0% found this document useful (0 votes)

247 views36 pages

Hadoop & Big Data

Big data is high-volume, high-velocity, and high-variety information assets that require cost-effective processing. Hadoop is an open-source framework for distributed processing of big data across clusters of machines using simple programming models. HDFS is the distributed file system of Hadoop that stores large files across clusters in a fault-tolerant manner, while MapReduce is the distributed processing framework. Apache Spark is a unified analytics engine that is faster than Hadoop for interactive queries.

Uploaded by

Paresh Bhatia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

247 views36 pages

Hadoop & Big Data

Uploaded by

Paresh Bhatia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 36

These are notes based on youtube videos by Learning Journal channel.

What is big data?

Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-
effective, innovative forms of information processing that enable enhanced insight, decision making,
and process automation. These data assets could be structured (csv, xls, table) or unstructured (text
document, email, webpage scraped data etc.)

3Vs of big data – Volume, Velocity and Variety.

Hadoop
Hadoop was first developed on the basis of two papers by google engineers named – google file system
(HDFS) & Map reduce (hadoop map reduce): Simplified data processing on large clusters.

The Apache Hadoop software library is an open source framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming models.

It is designed to scale up from single servers to thousands of machines, each offering local computation
and storage.

Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and
handle failures at the application layer, so delivering a highly-available service on top of a cluster of
computers, each of which may be prone to failures.

As of now (May 2020), latest hadoop version is 3.2.1.

Hadoop Core components are: HDFS (a distributed file system), Map reduce (a distributed processing
framework) and YARN (yet another resource negotiator).

Hadoop eco system contains lots of tools using Hadoop as its foundation:

Hive, Pig, Spark, HBase, Sqoop, Kafka, Flume, Oozie, Zookeeper etc.

HDFS
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity
hardware. It has many similarities with existing distributed file systems. However, the differences from
other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed
on low-cost hardware. Features of HDFS are:

1. Distributed – it stores a large file (TB or PB) not on a single machine but in chunks on several
machines. It makes it distributed.
2. Scalable – HDFS is deployed on low-cost commodity hardware machine, which we can add easily
and all machines are connected through a network. This is called horizontal scaling.

3. Cost-effective – Since it uses low cost commodity hardware not some specially built server
machines, so scaling it is cost effective.
4. Fault-tolerant – HDFS is fault tolerant, any of machine in its network can go down anytime.
HDFS has quick and fault detection mechanism as well as auto recovery.
5. High Throughput – HDFS has high throughput i.e. the number of records processed in high per
unit of time. It focus on throughput rather than latency(time to get first record). So HDFS, is not
a good choice for low latency requirements.
HDFS Architecture

Core Switch
Rack Switch Rack
Above is a sample picture of Hadoop cluster. Each Hadoop cluster has a Hadoop client through which it
interacts with name node and data nodes.
1. Hadoop Client interacts with cluster using core switch which is connected to several rack
switches in network. Rack can be considered as collection of computers connected together.
A Hadoop cluster can have multiple racks connected via network.
2. Hadoop follows a master-slave architecture in which one node or computer is assigned as
master(name node) which manages the File System Namespace and controls access to files by
clients.

Functions of NameNode:

 It is the master daemon that maintains and manages the DataNodes (slave nodes)
 It records the metadata of all the files stored in the cluster, e.g. The location of blocks
stored, the size of the files, permissions, hierarchy, etc. There are two files associated with the
metadata:
o FsImage: It contains the complete state of the file system namespace since the start of
the NameNode.
o EditLogs: It contains all the recent modifications made to the file system with respect to
the most recent FsImage.
 It records each change that takes place to the file system metadata. For example, if a file is
deleted in HDFS, the NameNode will immediately record this in the EditLog.
 It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to
ensure that the DataNodes are live.
 It keeps a record of all the blocks in HDFS and in which nodes these blocks are located.

HDFS write data in blocks of 128MB using FSDataOutputStream. It first stores data in local buffer and
when it reaches 128MB, it sends request to name node for block allocation request and then name node
checks data node information in fsimage for block allocation. It then allocates block in one of the
available data nodes.
3. DataNodes is the place where data is stored.

High Availability & Fault Tolerance in HDFS

1. HDFS uses replication for high availability which is to ensure uptime of data 99.99%. It replicates
a block of data on multiple data nodes so that if data on one node goes down, it can be fetched
from another data node. We can set replication factor while configuring Hadoop. We generally
set it to 3.
2. We always try to keep replicas on data nodes on different rack so that if whole rack fails then
data can be fetched from another rack. This is called rack awareness. It helps in reducing latency
and increase fault tolerance capability.
3. With Fault-tolerance capability, if a replica gets corrupt or its datanode/rack fails then hdfs
create replicas on another node so that each block has replicas as per replication factor.
4. To ensure availability of namenode, we can have its backup – a standby namenode. To take
backup of name node to standby NN, Hadoop use QJM(Quorum Journal Manager). It works as
follows:
 We put two fail-over controllers on each active namenode and standby namenode. In
between them, we use ZooKeeper which have a lock by active namenode first. If active
namenode fails or is down, then its lock expires and lock is acquired by standby
namenode. Also, the block report and heartbeat are configured to be sent to both active
namenode and standby namenode.
 QJM is has multiple journal nodes(usually 3) to each of which editLogs are written
continuously. We are keeping editLogs at multiple journal nodes as it is critical. These
edit logs are continuously read by Standby namenode and it is updating its in-memory
fsImage continuously. So our standby namenode can take place of active namenode
within seconds if it goes down.
Secondary NameNode
Sometimes we need to restart namenode due to several reason. In such case, namenode might take a
long time to create fsImage again from editLog. So, for such situation we have secondary namenode.

It reads editLog periodically, update the fsImage, stores it on disk and then truncate the editLog. EditLog
is also updated periodically on secondary namenode. In case of failure of primary namenode, secondary
namenode can quickly bring fsImage from disk to memory. It is faster than creating fsimage from scratch
using editLog. Rest of the changes made in editLogs after last update in fsimage will take very less time
in comparison to creating fsImage from scratch using editLogs.

In HA scenario, standby namenode can take responsibility of secondary namenode as well.

Apache Spark
Apache Spark is a unified analytics engine for big data processing, with built-in modules for
streaming, SQL, machine learning and graph processing. It is an open-source distributed general-
purpose cluster-computing framework. Spark provides an interface for programming entire clusters
with implicit data parallelism and fault tolerance. It is 10 to 100 times faster than Hadoop map
reduce.

Apache Spark framework primarily consist of two components:

1. A cluster computing engine(Spark Core).

2. A set of libraries, APIs and DSL(domain-specific language) functions(Spark SQL etc).
 Apache spark doesn’t provide cluster management and storage services. They need to be
provided by some external means like YARN, mesos and Kubernetes for cluster resource
manager and HDFS, S3, GCS, CFS etc. for distributed storage.
 Compute engine in spark core takes care of memory management, task scheduling, Fault
recovery, Interaction with cluster manager etc.
 Spark Core API has two parts: Structured and Unstructured.
 Structured API has dataframes and datasets while Unstructured have low level APIs related to
RDD, accumulators and broadcast variables.
 Spark Core API can be used with four languages: Scala, Python, Java & R.
 On top of Spark Core API, we have another set of libraries: Spark SQL, Spark Streaming, MLLib &
GraphX.
 Spark SQL - It allows to query the data via SQL (Structured Query Language) as well as the
Apache Hive variant of SQL called the HQL (Hive Query Language).
 Spark Streaming - Spark Streaming is a Spark component that supports scalable and
fault-tolerant processing of streaming data.
 MLLib - The MLLib is a Machine Learning library that contains various machine
learning algorithms.
 GraphX - The GraphX is a library that is used to manipulate graphs and perform
graph-parallel computations.
 We can execute spark programs in a cluster using below two methods: Interactive clients(Scala
shell, pyspark, notebooks) – suitable for exploration , learning and Spark submit utility –
suitable for running application in production.
 How does spark executes our program on cluster ? Spark is distributed programming engine. It
uses master slave architecture. For every application, it has one dedicated driver(master) and
multiple executors(slave). So each application has a its own set of dedicated driver and
executors.
 Spark driver is responsible for analyzing, distributing, scheduling and monitoring works across
the cluster. Executor’s job is to take the work from driver, execute it and report back the status
to driver.


Client:
 Driver runs on a dedicated server (Master node) inside a dedicated process. This means it has all
available resources at its disposal to execute work.
 Driver opens up a dedicated Netty HTTP server and distributes the JAR files specified to all
Worker nodes (big advantage).
 Because the Master node has dedicated resources of it's own, you don't need to "spend" worker
resources for the Driver program.
 If the driver process dies, you need an external monitoring system to reset it's execution.

Cluster:
 Driver runs on one of the cluster's Worker nodes. The worker is chosen by the Master leader
 Driver runs as a dedicated, standalone process inside the Worker.
 Driver programs takes up at least 1 core and a dedicated amount of memory from one of the
workers (this can be configured).
 Driver program can be monitored from the Master node using the --supervise flag and be reset
in case it dies.
 When working in Cluster mode, all JARs related to the execution of your application need to be
publicly available to all the workers. This means you can either manually place them in a shared
place or in a folder for each of the workers.
Yarn is majorly used cluster manager for spark. Kubernetes is also catching up now. It is a container
orchestration platform by google.

In client mode with yarn as cluster manager, driver creates a spark session on client machine and a
request is sent to yarn resource manager to start a yarn application. Resource manager starts a
application master which is responsible for launching executors. Application master then reaches out to
yarn resource manager to allot containers, once containers are allotted, application master starts an
executor in each of these containers. Now these executors can communicate directly with driver.

In cluster mode with yarn, we submit our packaged application using spark submit to resource manager.
Resource manager starts a yarn application master and it acts as a driver as well. Rest of the process is
same as in client mode.
In Local mode, a spark jvm is started. Driver and executor is started in this spark jvm only. No cluster is
there. It is good for learning purpose.
Spark application process flow

Spark has 3 major datastructure in which it holds data : RDD, DataSet & DataFrames. Under the hood,
DataSet and DataFrames are collection of RDDs only.

RDD

We can create RDD using below two method

1. Load data from a source like file or database.

2. Create an RDD using transforming another RDD
We can control the number of partitions in an RDD.

Directed Acyclic Graphs (DAGs)

Directed Acyclic Graph is a finite direct graph that performs a sequence of computations on
data. Each node is an RDD partition, and the edge is a transformation on top of data. Here,
the graph refers the navigation whereas directed and acyclic refers to how it is done.

Whenever there is a need to move data across partitions, shuffle and sort operations are used. Like we
need to group data by key in one partition then shuffle and sort is triggered.
http://datastrophic.io/core-concepts-architecture-and-internals-of-apache-spark/

https://www.freecodecamp.org/news/deep-dive-into-spark-internals-and-architecture-f6e32045393b/

Spark Interview Questions

1. Cache vs Persist. When to use which ? -The difference among them is

that cache() will cache the RDD into memory, whereas persist(level) can cache in memory, on
disk, or off-heap memory according to the caching strategy specified by level.
persist() without an argument is equivalent with cache(). We discuss caching strategies later in
this post. Freeing up space from the Storage memory is performed by unpersist().

Spark Persistance storage levels

All different storage level Spark supports are available
at org.apache.spark.storage.StorageLevel class. The storage level specifies how and where to
persist or cache a Spark DataFrame and Dataset.
MEMORY_ONLY – This is the default behavior of the RDD cache() method and stores the RDD or
DataFrame as deserialized objects to JVM memory. When there is no enough memory available
it will not save DataFrame of some partitions and these will be re-computed as and when
required. This takes more memory. but unlike RDD, this would be slower
than MEMORY_AND_DISK level as it recomputes the unsaved partitions and recomputing the
in-memory columnar representation of the underlying table is expensive
MEMORY_ONLY_SER – This is the same as MEMORY_ONLY but the difference being it stores
RDD as serialized objects to JVM memory. It takes lesser memory (space-efficient) then
MEMORY_ONLY as it saves objects as serialized and takes an additional few more CPU cycles in
order to deserialize.
MEMORY_ONLY_2 – Same as MEMORY_ONLY storage level but replicate each partition to two
cluster nodes.
MEMORY_ONLY_SER_2 – Same as MEMORY_ONLY_SER storage level but replicate each
partition to two cluster nodes.
MEMORY_AND_DISK – This is the default behavior of the DataFrame or Dataset. In this Storage
Level, The DataFrame will be stored in JVM memory as a deserialized objects. When required
storage is greater than available memory, it stores some of the excess partitions into disk and
reads the data from disk when it required. It is slower as there is I/O involved.
MEMORY_AND_DISK_SER – This is same as MEMORY_AND_DISK storage level difference being
it serializes the DataFrame objects in memory and on disk when space not available.
MEMORY_AND_DISK_2 – Same as MEMORY_AND_DISK storage level but replicate each
partition to two cluster nodes.
MEMORY_AND_DISK_SER_2 – Same as MEMORY_AND_DISK_SER storage level but replicate
each partition to two cluster nodes.
DISK_ONLY – In this storage level, DataFrame is stored only on disk and the CPU computation
time is high as I/O involved.
DISK_ONLY_2 – Same as DISK_ONLY storage level but replicate each partition to two cluster
nodes.

2. RDD, why rdd is immutable

Spark RDD is an immutable collection of objects for the following reasons:

 Immutable data can be shared safely across various processes and threads
 It allows you to easily recreate the RDD
 You can enhance the computation process by caching RDD

3. Why spark is lazy ?(concept of lazy evaluation)

Lazy evaluation in spark refers to concept that execution of program doesn’t happen until an
action is triggered. Transformations are lazy evaluated i.e. they are not executed immediately,
they are evaluated when an action in triggered. Spark maintains record of operations to be
called using DAG.
In Spark, driver program loads the code to the cluster. When the code executes after every
operation, the task will be time and memory consuming. Since each time data goes to the
cluster for evaluation.
Lazy evaluation saves the trips between driver and clusters, and data I/O operations is usually
the bottleneck of speed.
Lazy evaluation is important to Spark because of the development of Catalyst Optimizer in 2015.
At its core, Catalyst Optimizer optimizes the query execution by planning out the sequence of
computation and skipping potentially unnecessary steps.
4. Transformation vs actions
A transformation is an operation in spark which takes RDD as an input and return another RDD
as output. Transformation are lazy evaluated ie. Not evaluated until an action is triggered.
Example - Map, flatMap, reducebyKey, join etc. Basically, they are core piece of code where you
define your business logic.
Actions are operations in spark which returns final value to driver program or write data to some
external storage. Basically, when an action is triggered, Spark runs all transformation in RDD
Lineage and returns some value. Example count, reduce etc.
5. Narrow and wide transformations
Transformations consisting of narrow dependencies (we’ll call them narrow transformations)
are those for which each input partition will contribute to only one output partition. These
compute data that live on a single partition meaning There will not be any data movement
between partitions to execute narrow transformations. Example map and filter.

A wide dependency (or wide transformation) style transformation will have input partitions
contributing to many output partitions. You will often hear this referred to as a shuffle whereby
Spark will exchange partitions across the cluster. These compute data that live on many
partitions meaning there will be data movements between partitions to execute wider
transformations.

With narrow transformations, Spark will automatically perform an operation called pipelining,
meaning that if we specify multiple filters on DataFrames, they’ll all be performed in-memory.
The same cannot be said for shuffles. When we perform a shuffle, Spark writes the results to
disk. Example groupByKey() and reduceByKey()., aggregate, join, repartition.
6. Spark optimization techniques
7. Handle data skew problem in spark
Data skew is a condition in which data is unevenly distributed among partitions in cluster. Task
executed on skewed partition will take more time as compared to others. It downgrades the
performance of query, especially with join. It can cause large amount of shuffling of data which
is very expensive operation. Solution to data skewness in spark are :

1. Repartitioning – We can do repartition or coalesce on data to distribute data evenly on

partitions. Primary key of datasets must be used. But it doesn’t guarantee the fully even
partitioning of data.
2. Salting - If partitions is based on original key then we can observe imbalanced
distribution of data across the partitions. In order to curb this situation we should
modified our original keys to some modified keys whose hash partitioning cause the
proper distributions of records among the partitions. This technique is called Salting
technique.
https://www.youtube.com/watch?v=HIlfO1pGo0w&t=8s
https://itnext.io/handling-data-skew-in-apache-spark-9f56343e58e8

Other techniques include isolated salting, Isolated Map Join & iterative broadcast technique.

https://bigdatacraziness.wordpress.com/2018/01/05/oh-my-god-is-my-data-skewed/

8. Difference between reduce and reducebykey ? why reduce is

action and reduce by key is transformation ?
Reduce Aggregate the elements of a dataset through func. It is an action and returns a
value.

scala> val names1 = sc.parallelize(List("abe", "abby", "apple"))

names1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1467] at parallelize at <console>:12

scala> names1.reduce((t1,t2) => t1 + t2)

res778: String = abbyabeapple
ReduceByKey Operates on (K,V) pairs of course, but the func must be of type (V,V) => V
And returns an RDD. It is a transformation.
9. What is spark ?
10. Big data Cluster configuration
11. Row and columnar file format(use case) – In column oriented
format, rows in files are broken into row splits, then each row split is stored column wise like all
values in a row split are stored together, followed by all values in second column together and
so on. They are best for read-heavy transactional loads. Example are Parquet and ORC. They are
good for analytical workloads where we need to create reporting having selected columns. A
column-oriented layout permits column that are not accessed in a query to be skipped. Another
aspect to consider is support for schema evolution, or the ability for the file structure to change
over time.
In row-based format, all rows splits are stored in row-wised fashion i.e. values in a row are
stored together. Row oriented formats are appropriate when a large number of columns of a
single row are needed for processing at the same time. They are good for write-heavy
transaction loads. Example are Avro, CSV, TSV, Json.
Because of the way the data is optimized for fast retrieval, the column-based stores, Parquet
and ORC, offer higher compression rates than the row-based Avro format.
12. Parquet vs avro vs orc(use cases)- explain row vs file based format
difference like above. With Avro’s capacity to manage schema evolution, it’s possible to update
components independently, at different times, with low risk of incompatibility. Avro is a row-
based storage format for Hadoop which is widely used as a serialization platform. Avro stores
the data definition (schema) in JSON format making it easy to read and interpret by any
program. The data itself is stored in binary format making it compact and efficient. A key feature
of Avro is robust support for data schemas that change over time - schema evolution. Avro
handles schema changes like missing fields, added fields and changed fields; as a result, old
programs can read new data and new programs can read old data .
ORC - It ideally stores data compact and enables skipping over irrelevant parts without the need
for large, complex, or manually maintained indices. ORC stores collections of rows in one file
and within the collection the row data is stored in a columnar format.

COMPARISONS BETWEEN DIFFERENT FILE FORMATS

AVRO vs PARQUET

1. AVRO is a row-based storage format whereas PARQUET is a columnar based storage format.

2. PARQUET is much better for analytical querying i.e. reads and querying are much more
efficient than writing.

3. Write operations in AVRO are better than in PARQUET.

4. AVRO is much matured than PARQUET when it comes to schema evolution. PARQUET only
supports schema append whereas AVRO supports a much-featured schema evolution i.e.
adding or modifying columns.

5. PARQUET is ideal for querying a subset of columns in a multi-column table. AVRO is ideal in
case of ETL operations where we need to query all the columns.

ORC vs PARQUET

1. PARQUET is more capable of storing nested data.

2. ORC is more capable of Predicate Pushdown.

3. ORC supports ACID properties.

4. ORC is more compression efficient.

13. Different transformation and action used

https://supergloo.com/spark-scala/apache-spark-examples-of-transformations/
https://supergloo.com/spark-scala/apache-spark-examples-of-actions/
https://sparkbyexamples.com/apache-spark-rdd/spark-rdd-transformations/
14. Performance optimization done by you in spark project
15. Paired RDD
We can create an RDD containing key,value pair. Spark provides special operation on paired
RDDs. They are useful building blocks in many programs as they expose operations to act on
each key in parallel or regroup data across the network.
For example, pair RDDs have a reduceByKey() method that can
aggregate data separately for each key, and a join() method that can merge two
RDDs together by grouping elements with the same key. It is common to extract
fields from an RDD (representing, for instance, an event time, customer ID, or other
identifier) and use those fields as keys in pair RDD operations.
Example - val pairs = lines.map(x => (x.split(" ")(0), x))
16. Reduce by key vs group by key
In reduceByKey, Data is grouped at each partition, so that output has only one value for one key
to send over the network. After grouping, very less data is shuffled across partiton

Group by key first shuffles data from one partition to another so that one partition has data for
one key only and then it grouped (aggregation is performed)

aggregateByKey is similar to reduceByKey except it takes an initial value.

Note – All above 3 are wide operations
17. Shuffling in spark
http://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations

18. How to check job performance in spark using spark UI ?

DAG which gives stages in which spark job is executed
19. Lineage concept
Lineage is set of steps containing transformation on RDDs to get final RDD we are interested in.
It is a logical plan which tells more about dependency. We can check it using command:
<rdd>.toDebugString(). We can read it from bottom to top.
Directed Acyclic Graph is a finite direct graph that performs a sequence of computations on
data. Each node is an RDD partition, and the edge is a transformation on top of data. Here, the
graph refers the navigation whereas directed and acyclic refers to how it is done. Apache Spark
DAG allows the user to dive into the stage and expand on detail on any stage. In the stage
view, the details of all RDDs belonging to that stage are expanded. It gives more details about
how much things can be run in parallel and execution plan. We can visualize DAG in spark UI or
history server UI.
20. Use of broadcast variables
Broadcast variables are read-only shared variables that are cached and available on all nodes in
a cluster in-order to access or use by the tasks. Instead of sending this data along with every
task, spark distributes broadcast variables to the machine using efficient broadcast algorithms to
reduce communication costs.
Broadcast variables are useful in joining one large table and a small table. Like there is a column
called country code and their description is stored in another table. In result report, we want full
country name . So in join, we broadcast the small table to all the nodes so that they don’t need
to get it every time. It saves time and network I/O cost
https://sparkbyexamples.com/spark/spark-broadcast-variables/
Example
val spark = SparkSession.builder()
.appName("SparkByExamples.com")
.master("local")
.getOrCreate()

val states = Map(("NY","New York"),("CA","California"),("FL","Florida"))

val countries = Map(("USA","United States of America" ),("IN","India"))

val broadcastStates = spark.sparkContext.broadcast(states)

val broadcastCountries = spark.sparkContext.broadcast(countries)
To get value from broadcast variable, we can use value.get() method like below

val fullCountry = broadcastCountries.value.get(country).get

val fullState = broadcastStates.value.get(state).get
21. Accumulators - Accumulators are variables that are used for aggregating
information across the executors. For example, this information can pertain to data or API
diagnosis like how many records are corrupted or how many times a particular library API was
called. Driver can only read from accumulators but executors can only write to it.
Spark guarantees to update accumulators inside actions only once. So even if a task is restarted
and the lineage is recomputed, the accumulators will be updated only once. To be on the safe
side, always use accumulators inside actions ONLY.
A good use of accumulators is in analyzing large weblogs where we want to use spark to count
occurrence of different httpstatus codes. Here we can define accumulator for different
httpcodes. While analyzing weblog, each worker can update these accumulators as per they
find occurrence of corresponding httpcode. Check below link for example with code :
https://supergloo.com/spark-scala/spark-broadcast-accumulator-examples-scala/

For example, you can create long accumulator on spark-shell using

scala> val accum = sc.longAccumulator("SumAccumulator")

accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0,
name: Some(SumAccumulator), value: 0)
Copy
The above statement creates a named accumulator “SumAccumulator”.
Now, Let’s see how to add up the elements from an array to this
accumulator.

scala> sc.parallelize(Array(1, 2, 3)).foreach(x => accum.add(x))

-----
-----
scala> accum.value
res2: Long = 6

22. Different compression techniques in spark

23. RDD Vs Dataframe VS Dataset
https://community.simplilearn.com/threads/rdd-vs-dataframe-vs-dataset-in-apache-
spark.33973/
24. Partitioning in spark
In cluster computing, the central challenge is to minimize network traffic. When the data is key-
value oriented, partitioning becomes imperative because for subsequent transformations on the
RDD, there’s a fair amount of shuffling of data across the network. If similar keys or range of
keys are stored in the same partition then the shuffling is minimized and the processing
becomes substantially fast.
25. Type Safety
Type safety – meaning production applications can be checked for errors before they are run.
https://loicdescotte.github.io/posts/spark2-datasets-type-safety/
26. How spark or RDD is resilient(fault-tolerance)?
The basic fault-tolerant semantic of Spark are:

 Since Apache Spark RDD is an immutable dataset, each Spark RDD remembers the lineage
of the deterministic operation that was used on fault-tolerant input dataset to create it.
 If due to a worker node failure any partition of an RDD is lost, then that partition can be re-
computed from the original fault-tolerant dataset using the lineage of operations.
 Assuming that all of the RDD transformations are deterministic, the data in the final
transformed RDD will always be the same irrespective of failures in the Spark cluster.
To achieve fault tolerance for all the generated RDDs, the achieved data replicates among
multiple Spark executors in worker nodes in the cluster.

27. Difference between executor and driver – The driver is the process
where the main method runs. First it converts the user program into tasks and after that it
schedules the tasks on the executors. While Executors are worker nodes' processes in charge of
running individual tasks in a given Spark job. They are launched at the beginning of a Spark
application and typically run for the entire lifetime of an application. Once they have run the
task they send the results to the driver. They also provide in-memory storage for RDDs that are
cached by user programs through Block Manager.
28. What is catalyst optimizer?
What is Catalyst
Spark SQL was designed with an optimizer called Catalyst based on the functional programming
of Scala. Its two main purposes are: first, to add new optimization techniques to solve some
problems with “big data” and second, to allow developers to expand and customize the
functions of the optimizer.

Catalyst Spark SQL architecture and Catalyst optimizer integration

Catalyst components
The main components of the Catalyst optimizer are as follows:

Trees
The main data type in Catalyst is the tree. Each tree is composed of nodes, and each node has a
nodetype and zero or more children. These objects are immutable and can be manipulated with
functional language.

As an example, let me show you the use of the following nodes:

Merge(Attribute(x), Merge(Literal(1), Literal(2))

Where:

Literal(value: Int): a constant value

Attribute(name: String): an attribute as input row
Merge(left: TreeNode, right: TreeNode): mix of two expressions
Rules
Trees can be manipulated using rules, which are functions of a tree to another tree. The
transformation method applies the pattern matching function recursively on all nodes of the
tree transforming each pattern to the result. Below there’s an example of a rule applied to a
tree.

tree.transform {
case Merge(Literal(c1), Literal(c2)) => Literal(c1) + Literal(c2)
}
Using Catalyst in Spark SQL
The Catalyst Optimizer in Spark offers rule-based and cost-based optimization. Rule-based
optimization indicates how to execute the query from a set of defined rules. Meanwhile, cost-
based optimization generates multiple execution plans and compares them to choose the
lowest cost one.

Phases
The four phases of the transformation that Catalyst performs are as follows:

1. Analysis
Spark SQL begins with a relation to be computed, either from an abstract syntax tree (AST)
returned by a SQL parser, or from a DataFrame object constructed using the API. In both cases,
the relation may contain unresolved attribute references or relations: for example, in the SQL
query SELECT col FROM sales, the type of col, or even whether it is a valid column name, is not
known until we look up the table sales. An attribute is called unresolved if we do not know its
type or have not matched it to an input table (or an alias). Spark SQL uses Catalyst rules and a
Catalog object that tracks the tables in all data sources to resolve these attributes.

2. Logic Optimization Plan

The logical optimization phase applies standard rule-based optimizations to the logical plan.
(Cost-based optimization is performed by generating multiple plans using rules, and then
computing their costs.) These include constant folding, predicate pushdown, projection pruning,
null propagation, Boolean expression simplification, and other rules. It is possible to easily add
new rules.

3. Physical plan
In the physical plan phase, Spark SQL takes the logical plan and generates one or more physical
plans using the physical operators that match the Spark execution engine. The plan to be
executed is selected using the cost-based model (comparison between model costs).

4. Code generation
Code generation is the final phase of optimizing Spark SQL. To run on each machine, it is
necessary to generate Java code bytecode.

Phases of the query plan in Spark SQL. Rounded squares represent the Catalyst trees

Example
The Catalyst optimizer is enabled by default as of Spark 2.0, and contains optimizations to
manipulate datasets. Below is an example of the plan generated for a query of a Dataset from
the Spala SQL API of Scala:

// Business object
case class Persona(id: String, nombre: String, edad: Int)
// The dataset to query
val peopleDataset = Seq(
Persona("001", "Bob", 28),
Persona("002", "Joe", 34)).toDS
// The query to execute
val query = peopleDataset.groupBy("nombre").count().as("total")
// Get Catalyst optimization plan
query.explain(extended = true)
As a result, the detailed plan for the consultation is obtained:

== Analyzed Logical Plan ==

nombre: string, count: bigint
SubqueryAlias total
+- Aggregate [nombre#4], [nombre#4, count(1) AS count#11L]
+- LocalRelation [id#3, nombre#4, edad#5]
== Optimized Logical Plan ==
Aggregate [nombre#4], [nombre#4, count(1) AS count#11L]
+- LocalRelation [nombre#4]
== Physical Plan ==
*(2) HashAggregate(keys=[nombre#4], functions=[count(1)], output=[nombre#4, count#11L])
+- Exchange hashpartitioning(nombre#4, 200)
+- *(1) HashAggregate(keys=[nombre#4], functions=[partial_count(1)], output=[nombre#4,
count#17L])
+- LocalTableScan [nombre#4]
Conclusions
The Spark SQL Catalyst Optimizer improves developer productivity and the performance of their
written queries. Catalyst automatically transforms relational queries to execute them more
efficiently using techniques such as filtering, indexes and ensuring that data source joins are
performed in the most efficient order. In addition, its design allows the Spark community to
implement and extend the optimizer with new features.
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
29. Kappa Vs Lambda architecture
https://www.ericsson.com/en/blog/2015/11/data-processing-architectures--lambda-and-kappa
https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/
30. Executor vs Executor core – Executor is a java process started on your
cluster and each executor can have more than one thread executed on it, each thread is
attached to one core. These are called executor cores.
31. Hint Framework Spark SQL(broadcast join, sortmerge join
etc)
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-hint-framework.html

https://www.waitingforcode.com/apache-spark-sql/writing-apache-spark-sql-custom-logical-
optimization-unsupported-optimization-hints/read

http://blog.madhukaraphatak.com/spark-3-introduction-part-9/

Scala Interview question

1. Why scala for spark ? Not python or R ?
2. Design patterns in scala. Singleton or factory design pattern.
3. What is trait ?(used for multiple inheritance)
4. How scala helps in solving diamond problem ?
5. Trait vs abstract classes ?
6. Var vs val
7. What is case class and its usage?
8. UDF
Hive Interview questions
1. Hive optimization techniques
https://www.qubole.com/blog/hive-best-practices/
2. Bucketing vs partitioning
3. File format for hive – orc, text, json, parquet, avro
4. Serde
Java understands objects and hence object is a deserialized state of data. When you use
the same concept, Hive understands “columns” and hence if given a “row” of data, the
task of converting that data into columns is the Deserialization part of Hive SerDe. In
short

“A select statement creates deserialized data(columns) that is understood by Hive. An

insert statement creates serialized data(files) that can be stored into an external storage
like HDFS”.

https://medium.com/@gohitvaranasi/how-does-apache-hive-serde-work-behind-the-scenes-a-
theoretical-approach-e67636f08a2a
Happiest Minds

Meg Squats Uplifted (122 Pages)
100% (3)
Meg Squats Uplifted (122 Pages)
122 pages
Academic Writing Genres - Essays, Reports & Other Genres (EAP Foundation Book 2)
No ratings yet
Academic Writing Genres - Essays, Reports & Other Genres (EAP Foundation Book 2)
535 pages
Aws Certified Data Engineer Slides
100% (1)
Aws Certified Data Engineer Slides
696 pages
Saikumar - Resume - Mainframe 3 Yrs
0% (1)
Saikumar - Resume - Mainframe 3 Yrs
4 pages
Apache Airflow 1741977651
No ratings yet
Apache Airflow 1741977651
83 pages
The Secret To Solving Inferential Questions - GP Ka Funda
0% (1)
The Secret To Solving Inferential Questions - GP Ka Funda
42 pages
Japanese Mochi Recipes Daniel Humphreys
100% (1)
Japanese Mochi Recipes Daniel Humphreys
42 pages
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
Bertha L. Turner - The Federation Cook Book (CA. 1910)
100% (3)
Bertha L. Turner - The Federation Cook Book (CA. 1910)
100 pages
Developing Solutions For Microsoft Azure AZ 204 1726611181
No ratings yet
Developing Solutions For Microsoft Azure AZ 204 1726611181
74 pages
JCL Question Bank
No ratings yet
JCL Question Bank
63 pages
Cics Interview Questions
No ratings yet
Cics Interview Questions
188 pages
Airflow Introduction
No ratings yet
Airflow Introduction
9 pages
Unit I Introduction To Devops Notes
No ratings yet
Unit I Introduction To Devops Notes
20 pages
Cobol Interview Questions
No ratings yet
Cobol Interview Questions
45 pages
Database
No ratings yet
Database
145 pages
Sunburst Primary 1 BigBook 2
No ratings yet
Sunburst Primary 1 BigBook 2
36 pages
M4 - Introduction To Kubernetes Workloads v1.7
No ratings yet
M4 - Introduction To Kubernetes Workloads v1.7
107 pages
Tutorial-HDP-Administration V III
100% (1)
Tutorial-HDP-Administration V III
274 pages
Pool Activity Level (PAL) Instrument For Occupational Profiling
No ratings yet
Pool Activity Level (PAL) Instrument For Occupational Profiling
35 pages
Train With Shubham Syllabus
No ratings yet
Train With Shubham Syllabus
61 pages
Assignment 4
100% (1)
Assignment 4
24 pages
Slide 3 Hadoop MapReduce Tutorial
No ratings yet
Slide 3 Hadoop MapReduce Tutorial
119 pages
Apache Airflow Fundamentals Study Guide
No ratings yet
Apache Airflow Fundamentals Study Guide
7 pages
HashiCorp Certified - Terraform Associate
No ratings yet
HashiCorp Certified - Terraform Associate
1 page
2024 Bss Chemistry P I
No ratings yet
2024 Bss Chemistry P I
14 pages
Cloud Computing Day - 1
No ratings yet
Cloud Computing Day - 1
124 pages
Case Study Based On: Cloud Deployment and Service Delivery Models
No ratings yet
Case Study Based On: Cloud Deployment and Service Delivery Models
10 pages
0024.11 Ic Process Safety Management Certificate Web
No ratings yet
0024.11 Ic Process Safety Management Certificate Web
2 pages
Physics Unit & Mesaurement
No ratings yet
Physics Unit & Mesaurement
26 pages
Db2 Training Class 002
No ratings yet
Db2 Training Class 002
18 pages
Cloudera Hbase
100% (1)
Cloudera Hbase
145 pages
Cloud Computing Lab - Manual
No ratings yet
Cloud Computing Lab - Manual
30 pages
Linux Commands
No ratings yet
Linux Commands
67 pages
Iconex O2 Catalog
No ratings yet
Iconex O2 Catalog
103 pages
Cracking The Parajumbles
No ratings yet
Cracking The Parajumbles
4 pages
Db2 Training Class 001
No ratings yet
Db2 Training Class 001
13 pages
Learning From Imbalanced Data
No ratings yet
Learning From Imbalanced Data
22 pages
Azure Devops: Sato Naoki (Neo) - @satonaoki Jazug Tohoku Azure Devops #Jazug #Azuredevops
No ratings yet
Azure Devops: Sato Naoki (Neo) - @satonaoki Jazug Tohoku Azure Devops #Jazug #Azuredevops
34 pages
AWS Certified SysOps Administrator
No ratings yet
AWS Certified SysOps Administrator
3 pages
Top Kubernetes Interview Questions and Answers
No ratings yet
Top Kubernetes Interview Questions and Answers
26 pages
Data Mining N Business Intelligence
No ratings yet
Data Mining N Business Intelligence
63 pages
DataScienceWithPython Ed2018
No ratings yet
DataScienceWithPython Ed2018
66 pages
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
From Everand
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
Venkata Sasi Kanumuri
No ratings yet
Pig Full Lecture
No ratings yet
Pig Full Lecture
38 pages
InSmarter Delivery - Data Cleansing - Improving Region Availability
No ratings yet
InSmarter Delivery - Data Cleansing - Improving Region Availability
1 page
Architecting On Aws1
No ratings yet
Architecting On Aws1
4 pages
One Stop For Colleges Education Career: Minglebox e-CAT Prep
No ratings yet
One Stop For Colleges Education Career: Minglebox e-CAT Prep
0 pages
COM-421-Lecture-Notes-7 - Open Stack
No ratings yet
COM-421-Lecture-Notes-7 - Open Stack
24 pages
Mortar Pig Cheat Sheet
50% (2)
Mortar Pig Cheat Sheet
13 pages
Apache Pig
No ratings yet
Apache Pig
61 pages
2 Hadoop (Uploaded)
No ratings yet
2 Hadoop (Uploaded)
82 pages
50 MCQ Database Questions
No ratings yet
50 MCQ Database Questions
16 pages
Cloudera Hive
No ratings yet
Cloudera Hive
106 pages
Curriculum-Vitae Paresh Bhatia: Contact No: 9873253246
No ratings yet
Curriculum-Vitae Paresh Bhatia: Contact No: 9873253246
3 pages
Oozie Tutorial
No ratings yet
Oozie Tutorial
84 pages
Hadoop Admin Course
No ratings yet
Hadoop Admin Course
8 pages
Operational Excellence - Mainframe Program Indentation Tool
No ratings yet
Operational Excellence - Mainframe Program Indentation Tool
1 page
Certification
No ratings yet
Certification
16 pages
Module 3 Becg
No ratings yet
Module 3 Becg
23 pages
Austria Info: Culture 2010
No ratings yet
Austria Info: Culture 2010
52 pages
Hadoop
No ratings yet
Hadoop
34 pages
Donald Ngandeu 1
No ratings yet
Donald Ngandeu 1
6 pages
1 Hdfs Notes
No ratings yet
1 Hdfs Notes
38 pages
Ambari Operations
No ratings yet
Ambari Operations
194 pages
Tomcat
100% (1)
Tomcat
36 pages
Hadoop Security S360 2015v8 PDF
No ratings yet
Hadoop Security S360 2015v8 PDF
27 pages
E Portfolio Reflection
No ratings yet
E Portfolio Reflection
2 pages
05.azure Data Lake Authentication
No ratings yet
05.azure Data Lake Authentication
16 pages
XAT 2012 Question Paper With Detailed Solutions and Answer Key
No ratings yet
XAT 2012 Question Paper With Detailed Solutions and Answer Key
38 pages
Chap 4 Job Costing
No ratings yet
Chap 4 Job Costing
9 pages
Ansible 2
No ratings yet
Ansible 2
15 pages
Hands-On Hadoop Tutorial
100% (1)
Hands-On Hadoop Tutorial
13 pages
Cloudera Administration Study Guide
No ratings yet
Cloudera Administration Study Guide
3 pages
Aws Interview
No ratings yet
Aws Interview
4 pages
Administration of Hadoop Summer 2014 Lab Guide v3.1
No ratings yet
Administration of Hadoop Summer 2014 Lab Guide v3.1
107 pages
Final Review
100% (1)
Final Review
109 pages
Condition Monitoring of A Surface Mounted Permanen
No ratings yet
Condition Monitoring of A Surface Mounted Permanen
18 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Work Life Balance of Students in Evening Class in Assumption College of Davao
No ratings yet
Work Life Balance of Students in Evening Class in Assumption College of Davao
5 pages
Chapter 7: Data Link Control Protocols True or False: Data and Computer Communications, 10 Edition, by William Stallings
No ratings yet
Chapter 7: Data Link Control Protocols True or False: Data and Computer Communications, 10 Edition, by William Stallings
5 pages
Definition
No ratings yet
Definition
4 pages
Claes 20 Gauge Vitrectomy System
No ratings yet
Claes 20 Gauge Vitrectomy System
8 pages
Amazon EMR Security: © 2018, Amazon Web Services, Inc. or Its Affiliates. All Rights Reserved
No ratings yet
Amazon EMR Security: © 2018, Amazon Web Services, Inc. or Its Affiliates. All Rights Reserved
16 pages
Group 3 KALAGAN - MANDAYA Reporter Rodolfo D Mendez
No ratings yet
Group 3 KALAGAN - MANDAYA Reporter Rodolfo D Mendez
7 pages
Machine Learning Spark ML
No ratings yet
Machine Learning Spark ML
11 pages
Dragonborn Warlock 3rd Level
No ratings yet
Dragonborn Warlock 3rd Level
3 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
76 pages
Data Science Internship
No ratings yet
Data Science Internship
2 pages
Oxygen Scavenging Packaging Systems
No ratings yet
Oxygen Scavenging Packaging Systems
10 pages
Hadoop Pig Presentation
No ratings yet
Hadoop Pig Presentation
33 pages
Mobile-IP Seminar Report
No ratings yet
Mobile-IP Seminar Report
33 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
24 Hadoop Interview Questions & Answers For MapReduce Developers - FromDev
No ratings yet
24 Hadoop Interview Questions & Answers For MapReduce Developers - FromDev
7 pages
Resume 2011
No ratings yet
Resume 2011
2 pages
Finding the Best IT Job in the Boston Area
From Everand
Finding the Best IT Job in the Boston Area
Michael Moshe
No ratings yet
One Word Substitutes - Types of Persons
No ratings yet
One Word Substitutes - Types of Persons
2 pages
Burns Patent PDF
No ratings yet
Burns Patent PDF
11 pages
3 Laptop 26 Oktober 2020
No ratings yet
3 Laptop 26 Oktober 2020
1 page
Aws Certification Map en Us PDF
No ratings yet
Aws Certification Map en Us PDF
1 page
Amazon Web Services Training
No ratings yet
Amazon Web Services Training
5 pages
PIG Interview Qusetions
No ratings yet
PIG Interview Qusetions
15 pages
Citing Your Work Activity
No ratings yet
Citing Your Work Activity
2 pages
Jessica Chong CV - Sep 2018
No ratings yet
Jessica Chong CV - Sep 2018
1 page
Cloudera Administrator Training For Apache Hadoop
No ratings yet
Cloudera Administrator Training For Apache Hadoop
3 pages
Nagoyamotor Com Kawasaki-Catalog
No ratings yet
Nagoyamotor Com Kawasaki-Catalog
5 pages
Machine Learning Draft
No ratings yet
Machine Learning Draft
31 pages
Stat110 PDF
No ratings yet
Stat110 PDF
32 pages
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
Kubernetes A Complete Guide
From Everand
Kubernetes A Complete Guide
Gerardus Blokdyk
No ratings yet

Hadoop & Big Data

Uploaded by

Hadoop & Big Data

Uploaded by

These are notes based on youtube videos by Learning Journal channel.

What is big data?

3Vs of big data – Volume, Velocity and Variety.

As of now (May 2020), latest hadoop version is 3.2.1.

High Availability & Fault Tolerance in HDFS

In HA scenario, standby namenode can take responsibility of secondary namenode as well.

Apache Spark framework primarily consist of two components:

1. A cluster computing engine(Spark Core).

We can create RDD using below two method

1. Load data from a source like file or database.

Directed Acyclic Graphs (DAGs)

Spark Interview Questions

1. Cache vs Persist. When to use which ? -The difference among them is

Spark Persistance storage levels

2. RDD, why rdd is immutable

3. Why spark is lazy ?(concept of lazy evaluation)

1. Repartitioning – We can do repartition or coalesce on data to distribute data evenly on

8. Difference between reduce and reducebykey ? why reduce is

scala> val names1 = sc.parallelize(List("abe", "abby", "apple"))

scala> names1.reduce((t1,t2) => t1 + t2)

COMPARISONS BETWEEN DIFFERENT FILE FORMATS

3. Write operations in AVRO are better than in PARQUET.

1. PARQUET is more capable of storing nested data.

2. ORC is more capable of Predicate Pushdown.

3. ORC supports ACID properties.

4. ORC is more compression efficient.

13. Different transformation and action used

aggregateByKey is similar to reduceByKey except it takes an initial value.

18. How to check job performance in spark using spark UI ?

val states = Map(("NY","New York"),("CA","California"),("FL","Florida"))

val broadcastStates = spark.sparkContext.broadcast(states)

val fullCountry = broadcastCountries.value.get(country).get

For example, you can create long accumulator on spark-shell using

scala> val accum = sc.longAccumulator("SumAccumulator")

scala> sc.parallelize(Array(1, 2, 3)).foreach(x => accum.add(x))

22. Different compression techniques in spark

Catalyst Spark SQL architecture and Catalyst optimizer integration

As an example, let me show you the use of the following nodes:

Merge(Attribute(x), Merge(Literal(1), Literal(2))

Literal(value: Int): a constant value

2. Logic Optimization Plan

== Analyzed Logical Plan ==

Scala Interview question

“A select statement creates deserialized data(columns) that is understood by Hive. An

You might also like