Hadoop & Big Data
Hadoop & Big Data
Hadoop
Hadoop was first developed on the basis of two papers by google engineers named – google file system
(HDFS) & Map reduce (hadoop map reduce): Simplified data processing on large clusters.
The Apache Hadoop software library is an open source framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming models.
It is designed to scale up from single servers to thousands of machines, each offering local computation
and storage.
Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and
handle failures at the application layer, so delivering a highly-available service on top of a cluster of
computers, each of which may be prone to failures.
Hadoop eco system contains lots of tools using Hadoop as its foundation:
Hive, Pig, Spark, HBase, Sqoop, Kafka, Flume, Oozie, Zookeeper etc.
HDFS
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity
hardware. It has many similarities with existing distributed file systems. However, the differences from
other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed
on low-cost hardware. Features of HDFS are:
1. Distributed – it stores a large file (TB or PB) not on a single machine but in chunks on several
machines. It makes it distributed.
2. Scalable – HDFS is deployed on low-cost commodity hardware machine, which we can add easily
and all machines are connected through a network. This is called horizontal scaling.
3. Cost-effective – Since it uses low cost commodity hardware not some specially built server
machines, so scaling it is cost effective.
4. Fault-tolerant – HDFS is fault tolerant, any of machine in its network can go down anytime.
HDFS has quick and fault detection mechanism as well as auto recovery.
5. High Throughput – HDFS has high throughput i.e. the number of records processed in high per
unit of time. It focus on throughput rather than latency(time to get first record). So HDFS, is not
a good choice for low latency requirements.
HDFS Architecture
Core Switch
Rack Switch Rack
Above is a sample picture of Hadoop cluster. Each Hadoop cluster has a Hadoop client through which it
interacts with name node and data nodes.
1. Hadoop Client interacts with cluster using core switch which is connected to several rack
switches in network. Rack can be considered as collection of computers connected together.
A Hadoop cluster can have multiple racks connected via network.
2. Hadoop follows a master-slave architecture in which one node or computer is assigned as
master(name node) which manages the File System Namespace and controls access to files by
clients.
Functions of NameNode:
It is the master daemon that maintains and manages the DataNodes (slave nodes)
It records the metadata of all the files stored in the cluster, e.g. The location of blocks
stored, the size of the files, permissions, hierarchy, etc. There are two files associated with the
metadata:
o FsImage: It contains the complete state of the file system namespace since the start of
the NameNode.
o EditLogs: It contains all the recent modifications made to the file system with respect to
the most recent FsImage.
It records each change that takes place to the file system metadata. For example, if a file is
deleted in HDFS, the NameNode will immediately record this in the EditLog.
It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to
ensure that the DataNodes are live.
It keeps a record of all the blocks in HDFS and in which nodes these blocks are located.
HDFS write data in blocks of 128MB using FSDataOutputStream. It first stores data in local buffer and
when it reaches 128MB, it sends request to name node for block allocation request and then name node
checks data node information in fsimage for block allocation. It then allocates block in one of the
available data nodes.
3. DataNodes is the place where data is stored.
It reads editLog periodically, update the fsImage, stores it on disk and then truncate the editLog. EditLog
is also updated periodically on secondary namenode. In case of failure of primary namenode, secondary
namenode can quickly bring fsImage from disk to memory. It is faster than creating fsimage from scratch
using editLog. Rest of the changes made in editLogs after last update in fsimage will take very less time
in comparison to creating fsImage from scratch using editLogs.
Apache Spark
Apache Spark is a unified analytics engine for big data processing, with built-in modules for
streaming, SQL, machine learning and graph processing. It is an open-source distributed general-
purpose cluster-computing framework. Spark provides an interface for programming entire clusters
with implicit data parallelism and fault tolerance. It is 10 to 100 times faster than Hadoop map
reduce.
Client:
Driver runs on a dedicated server (Master node) inside a dedicated process. This means it has all
available resources at its disposal to execute work.
Driver opens up a dedicated Netty HTTP server and distributes the JAR files specified to all
Worker nodes (big advantage).
Because the Master node has dedicated resources of it's own, you don't need to "spend" worker
resources for the Driver program.
If the driver process dies, you need an external monitoring system to reset it's execution.
Cluster:
Driver runs on one of the cluster's Worker nodes. The worker is chosen by the Master leader
Driver runs as a dedicated, standalone process inside the Worker.
Driver programs takes up at least 1 core and a dedicated amount of memory from one of the
workers (this can be configured).
Driver program can be monitored from the Master node using the --supervise flag and be reset
in case it dies.
When working in Cluster mode, all JARs related to the execution of your application need to be
publicly available to all the workers. This means you can either manually place them in a shared
place or in a folder for each of the workers.
Yarn is majorly used cluster manager for spark. Kubernetes is also catching up now. It is a container
orchestration platform by google.
In client mode with yarn as cluster manager, driver creates a spark session on client machine and a
request is sent to yarn resource manager to start a yarn application. Resource manager starts a
application master which is responsible for launching executors. Application master then reaches out to
yarn resource manager to allot containers, once containers are allotted, application master starts an
executor in each of these containers. Now these executors can communicate directly with driver.
In cluster mode with yarn, we submit our packaged application using spark submit to resource manager.
Resource manager starts a yarn application master and it acts as a driver as well. Rest of the process is
same as in client mode.
In Local mode, a spark jvm is started. Driver and executor is started in this spark jvm only. No cluster is
there. It is good for learning purpose.
Spark application process flow
Spark has 3 major datastructure in which it holds data : RDD, DataSet & DataFrames. Under the hood,
DataSet and DataFrames are collection of RDDs only.
RDD
Whenever there is a need to move data across partitions, shuffle and sort operations are used. Like we
need to group data by key in one partition then shuffle and sort is triggered.
http://datastrophic.io/core-concepts-architecture-and-internals-of-apache-spark/
https://www.freecodecamp.org/news/deep-dive-into-spark-internals-and-architecture-f6e32045393b/
Immutable data can be shared safely across various processes and threads
It allows you to easily recreate the RDD
You can enhance the computation process by caching RDD
A wide dependency (or wide transformation) style transformation will have input partitions
contributing to many output partitions. You will often hear this referred to as a shuffle whereby
Spark will exchange partitions across the cluster. These compute data that live on many
partitions meaning there will be data movements between partitions to execute wider
transformations.
With narrow transformations, Spark will automatically perform an operation called pipelining,
meaning that if we specify multiple filters on DataFrames, they’ll all be performed in-memory.
The same cannot be said for shuffles. When we perform a shuffle, Spark writes the results to
disk. Example groupByKey() and reduceByKey()., aggregate, join, repartition.
6. Spark optimization techniques
7. Handle data skew problem in spark
Data skew is a condition in which data is unevenly distributed among partitions in cluster. Task
executed on skewed partition will take more time as compared to others. It downgrades the
performance of query, especially with join. It can cause large amount of shuffling of data which
is very expensive operation. Solution to data skewness in spark are :
Other techniques include isolated salting, Isolated Map Join & iterative broadcast technique.
https://bigdatacraziness.wordpress.com/2018/01/05/oh-my-god-is-my-data-skewed/
AVRO vs PARQUET
1. AVRO is a row-based storage format whereas PARQUET is a columnar based storage format.
2. PARQUET is much better for analytical querying i.e. reads and querying are much more
efficient than writing.
4. AVRO is much matured than PARQUET when it comes to schema evolution. PARQUET only
supports schema append whereas AVRO supports a much-featured schema evolution i.e.
adding or modifying columns.
5. PARQUET is ideal for querying a subset of columns in a multi-column table. AVRO is ideal in
case of ETL operations where we need to query all the columns.
ORC vs PARQUET
Group by key first shuffles data from one partition to another so that one partition has data for
one key only and then it grouped (aggregation is performed)
Since Apache Spark RDD is an immutable dataset, each Spark RDD remembers the lineage
of the deterministic operation that was used on fault-tolerant input dataset to create it.
If due to a worker node failure any partition of an RDD is lost, then that partition can be re-
computed from the original fault-tolerant dataset using the lineage of operations.
Assuming that all of the RDD transformations are deterministic, the data in the final
transformed RDD will always be the same irrespective of failures in the Spark cluster.
To achieve fault tolerance for all the generated RDDs, the achieved data replicates among
multiple Spark executors in worker nodes in the cluster.
27. Difference between executor and driver – The driver is the process
where the main method runs. First it converts the user program into tasks and after that it
schedules the tasks on the executors. While Executors are worker nodes' processes in charge of
running individual tasks in a given Spark job. They are launched at the beginning of a Spark
application and typically run for the entire lifetime of an application. Once they have run the
task they send the results to the driver. They also provide in-memory storage for RDDs that are
cached by user programs through Block Manager.
28. What is catalyst optimizer?
What is Catalyst
Spark SQL was designed with an optimizer called Catalyst based on the functional programming
of Scala. Its two main purposes are: first, to add new optimization techniques to solve some
problems with “big data” and second, to allow developers to expand and customize the
functions of the optimizer.
Catalyst components
The main components of the Catalyst optimizer are as follows:
Trees
The main data type in Catalyst is the tree. Each tree is composed of nodes, and each node has a
nodetype and zero or more children. These objects are immutable and can be manipulated with
functional language.
tree.transform {
case Merge(Literal(c1), Literal(c2)) => Literal(c1) + Literal(c2)
}
Using Catalyst in Spark SQL
The Catalyst Optimizer in Spark offers rule-based and cost-based optimization. Rule-based
optimization indicates how to execute the query from a set of defined rules. Meanwhile, cost-
based optimization generates multiple execution plans and compares them to choose the
lowest cost one.
Phases
The four phases of the transformation that Catalyst performs are as follows:
1. Analysis
Spark SQL begins with a relation to be computed, either from an abstract syntax tree (AST)
returned by a SQL parser, or from a DataFrame object constructed using the API. In both cases,
the relation may contain unresolved attribute references or relations: for example, in the SQL
query SELECT col FROM sales, the type of col, or even whether it is a valid column name, is not
known until we look up the table sales. An attribute is called unresolved if we do not know its
type or have not matched it to an input table (or an alias). Spark SQL uses Catalyst rules and a
Catalog object that tracks the tables in all data sources to resolve these attributes.
3. Physical plan
In the physical plan phase, Spark SQL takes the logical plan and generates one or more physical
plans using the physical operators that match the Spark execution engine. The plan to be
executed is selected using the cost-based model (comparison between model costs).
4. Code generation
Code generation is the final phase of optimizing Spark SQL. To run on each machine, it is
necessary to generate Java code bytecode.
Phases of the query plan in Spark SQL. Rounded squares represent the Catalyst trees
Example
The Catalyst optimizer is enabled by default as of Spark 2.0, and contains optimizations to
manipulate datasets. Below is an example of the plan generated for a query of a Dataset from
the Spala SQL API of Scala:
// Business object
case class Persona(id: String, nombre: String, edad: Int)
// The dataset to query
val peopleDataset = Seq(
Persona("001", "Bob", 28),
Persona("002", "Joe", 34)).toDS
// The query to execute
val query = peopleDataset.groupBy("nombre").count().as("total")
// Get Catalyst optimization plan
query.explain(extended = true)
As a result, the detailed plan for the consultation is obtained:
https://www.waitingforcode.com/apache-spark-sql/writing-apache-spark-sql-custom-logical-
optimization-unsupported-optimization-hints/read
http://blog.madhukaraphatak.com/spark-3-introduction-part-9/
https://medium.com/@gohitvaranasi/how-does-apache-hive-serde-work-behind-the-scenes-a-
theoretical-approach-e67636f08a2a
Happiest Minds