Discover millions of ebooks, audiobooks, and so much more with a free trial

From $11.99/month after trial. Cancel anytime.

Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Ebook492 pages3 hours

Exploring Hadoop Ecosystem (Volume 1): Batch Processing

Rating: 0 out of 5 stars

()

Read preview

About this ebook

The Hadoop ecosystem consists of many components. It is a headache for people who want to learn or understand them. This book can help data engineers or architects understand the internals of the big data technologies, starting from the basic HDFS and MapReduce to Kafka, Spark, etc. There are currently 2 volumes, the volume 1 mainly describes batch processing, and the volume 2 mainly describes stream processing.
LanguageEnglish
PublisherLulu.com
Release dateMar 31, 2021
ISBN9781667186184
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Author

Wei Liu

Wei Liu is Doctor of engineering at Beijing University of Aeronautics and Astronautics, Professor of Beijing University of Posts and Telecommunications, Visiting scholar of Cambridge University, Expert of Artificial Intelligence Group, Center for strategy and security, Tsinghua University and vice chairman of cognitive branch of the China Association of Command-and-Control His research interests include human-computer integration intelligence, cognitive engineering, human-machine- environment system engineering, future situation awareness mode and behavior analysis / prediction technology, etc. So far, he has published more than 70 papers, 4 monographs and 2 translations. At present, he is a distinguished expert of Expert Committee of China information and Electronic Engineering Science and technology development center, an appraisal expert of National Natural Science Foundation of China, a member of national ergonomics Standardization Technical Committee, and a senior member of the Chinese artificial intelligence society.

Read more from Wei Liu

Related to Exploring Hadoop Ecosystem (Volume 1)

Related ebooks

Computers For You

View More

Related articles

Reviews for Exploring Hadoop Ecosystem (Volume 1)

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Exploring Hadoop Ecosystem (Volume 1) - Wei Liu

    Infrastructure

    Hadoop Ecosystem

    Generally, when we refer to Hadoop, this means an extensive software package- also called Hadoop Ecosystem. From here, we can find the core components (Core Hadoop Framework) as well as various extensions which add various functions to the core framework for processing large amounts of data. Hadoop ecosystem includes both Apache open source projects and other wide variety of commercial tools and solutions. Each of the Hadoop ecosystem components has its own developer community and individual release cycle.

    Core Framework

    Core Hadoop Framework constitutes the basis of the Hadoop ecosystem. The framework itself is mostly written in the Java language, with some native code in C and command line utilities written as shell scripts, composed of the following modules,

    图片包含 屏幕截图 描述已自动生成

    Hadoop 2.x Architecture has one new component that is YARN. It is the game changing component. In Hadoop 1.x, both application management and resource management were done by the MapReduce but with Hadoop 2.x, MapReduce is managing application management and YARN is managing the resources.

    Related projects

    In addition to the core components, the Hadoop ecosystem encompasses a wide range of extensions. The following diagram shows a list of projects that will be covered in this book.

    Hive

    Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.

    Tez

    Apache TEZ is aimed at building an application framework which allows for a complex DAG of tasks for processing data.

    HBase

    Apache HBase is the Hadoop database, a distributed, scalable, big data store.

    ZooKeeper

    Apache ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

    Oozie

    Apache Oozie is a workflow scheduler system to manage Apache Hadoop jobs.

    Sqoop

    Apache Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases.

    Flume

    Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.

    Ambari

    Apache Ambari is aimed at making Hadoop management simpler. Ambari provides an intuitive, easy-to-use Hadoop management web UI and RESTful APIs.

    Spark

    Apache Spark is a unified analytics engine for large-scale data processing.

    Kafka

    Apache Kafka is a distributed event streaming platform.

    Hadoop Distributions

    Hadoop is an open-source project under the Apache Software Foundation, and most components in the Hadoop ecosystem are also open-sourced. Several Hadoop vendors have stepped in to develop their own distributions on top of Hadoop framework to make it enterprise ready. They have added new functionalities by improving the code base and bundling it with easy to use and user-friendly management tools, technical support and continuous updates.

    There are many distributions available in the market. These distributions pull together all the enhancement projects present in the Apache repository and present them as a unified product so that organizations don’t have to spend time on assembling these elements into a single functional component. The most recognized Hadoop Distributions available in the market are- Cloudera, Hortonworks and MapR. All the three- Cloudera, Hortonworks and MapR use the core Hadoop framework and bundle it for enterprise use. The features offered as a part of core distribution by these vendors include support service and subscription service model.

    Cloudera

    Cloudera is an open source Hadoop distribution that was founded in 2008. Cloudera is the oldest distribution available. People at Cloudera are committed to contributing to the open source community and they have contributed to the building of Hive, Impala, Hadoop, Pig, and other popular open-source projects. Cloudera comes with good tools packaged together to provide a good Hadoop experience. They also provide a nice GUI interface to manage and monitor clusters, known as Cloudera manager.

    Hortonworks

    Hortonworks was founded in 2011 and it comes with the Hortonworks Data Platform (HDP), which is an open-source Hadoop distribution. Hortonworks Distribution is widely used in organizations and it provides an Apache Ambari GUI-based interface to manage and monitor clusters. Hortonworks contributes to many open-source projects such as Apache tez, Hadoop, YARN, and Hive. Hortonworks has recently launched a Hortonworks Data Flow (HDF) platform for the purpose of data ingestion and storage. Hortonworks distribution also focuses on the security aspect of Hadoop and has integrated Ranger, Kerberos, and SSL-like security with the HDP and HDF platforms.

    MapR

    MapR was founded in 2009 and it has its own filesystem called MapR-FS, which is quite similar to HDFS but with some new features built by MapR. It boasts higher performance; it also consists of a few nice sets of tools to manage and administer a cluster, and it does not suffer from a single point of failure. It offers some useful features, such as mirroring and snapshots.

    Cloudera and Hortonworks merged in 2019 and the new company uses the Cloudera brand.

    There are a few popular distributions available for the cloud.

    Microsoft Azure

    Microsoft offers HDInsight as a Hadoop distribution. It also offers a cost-effective solution for Hadoop infrastructure setup, monitoring and managing cluster resources. Azure claims to provide a fully cloud-based cluster with 99.9% Service Level Agreements (SLA).

    Amazon

    Amazon provides Elastic MapReduce and many other Hadoop ecosystem tools in their distribution. They have the s3 File System, which is another alternative to HDFS. They offer a cost-effective setup for Hadoop on cloud and it is currently the most actively used cloud on Hadoop distributions.

    All the examples in this book use HDP 3.1.4.

    Google File System is a proprietary distributed file system developed by Google to provide efficient, reliable access to data using large clusters of commodity hardware. HDFS is an open source Java product similar to GFS, based on the paper Google published about their Google File System in 2003.

    HDFS, short for Hadoop Distributed File System, is a distributed file system that handles large data sets running on commodity hardware. It is used to scale a single Apache Hadoop cluster to hundreds and even thousands of nodes. HDFS is one of the major components of Apache Hadoop, the others being MapReduce and YARN.

    Hadoop FileSystem APIs

    Hadoop has an abstract notion of filesystem. The specification of the Hadoop FileSystem APIs can be found here: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html.

    org.apache.hadoop.fs.FileSystem is an abstract base class for a fairly generic filesystem. It may be implemented as a distributed filesystem, or as a local system. The local implementation is LocalFileSystem and distributed implementation is DistributedFileSystem. The local implementation exists for small Hadoop instances and for testing. HDFS is a distributed file system that comes with Hadoop that implements this file interface.

    Hadoop Compatible File Systems

    There are other implementations for object stores and third party filesystems. All such filesystems must implement the org.apache.hadoop.fs.FileSystem class, which ensures that there is an API for applications such as MapReduce, Apache HBase, Apache Giraph and others can use. We can these filesystems as Hadoop Compatible File Systems, such as Azure Data Lake Storage, Azure Blob, Amazon S3, Aliyun OSS, and OpenStack Swift.

    Example of Microsoft Azure

    The hadoop-azure module provides support for integration with Azure Blob and Azure Data Lake Storage Gen2. The built jar file, named hadoop-azure.jar, includes two drivers. Windows Azure Storage Blob driver or WASB driver provides support for Azure Blob. Azure Blob File System driver or ABFS driver provides support for Azure Data Lake Storage Gen2.

    The hadoop-azure-datalake module provides support for integration with the Azure Data Lake Storage Gen1. The JAR file azure-datalake-store.jar includes one driver: Azure Data Lake driver.

    WASB, ABFS, and ADL driver, built on top of HDFS APIs, are all parts of Apache Hadoop and are included in many of the commercial distributions of Hadoop. We can mount Azure Storage manually to a Hadoop cluster that lives anywhere as long as it has Internet access to Azure Storage.

    URI syntax

    For the Azure Standard blob the URI is: wasb[s]://container@account_name.blob.core.windows.net///

    For ADLS Gen2 the URI is: abfs[s]://file_system@account_name.dfs.core.windows.net///

    For ADLS Gen1 the URI is: adl://.azuredatalakestore.net//

    HDInsight

    Meanwhile, these drivers are all built into HDInsight. During the HDInsight cluster creation process, we can specify a blob container in Azure Blob or Azure Data Lake Storage Gen2 / Gen1 as the default files system.

    When migrating big data workloads to the cloud, one of the most commonly asked questions is how to evaluate HDFS versus the storage systems provided by cloud providers, such as Amazon S3, Microsoft Azure Blob, Microsoft Azure Data Lake Storage.

    HDFS Architecture

    图片包含 屏幕截图 描述已自动生成

    HDFS has a master-slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the filesystem namespace and regulates access to files by clients. In addition, there are a number of DataNodes, which manage storage attached to the nodes that they run on.

    Components of HDFS Architecture

    Master daemon.

    Maintains and manages DNs.

    Records metadata.

    Receives Heartbeat and Blockreport from DNs.

    It is not a NN backup.

    It merges the fsimage and the edits files periodically and keeps edits size within a limit. It is usually run one different machine than the NN.

    Slave daemons.

    Stores actual block and block metadata (length, checksum, timestamp, etc.). Although DNs do not contain metadata about the directories and files stored in an HDFS cluster, they do contain a small amount of metadata about the DN itself and its relationship to a cluster.

    Serves read and write requests from HDFS Client.

    HDFS currently provides 3 client interfaces: DistributedFileSystem, FsShell, and DFSAdmin.

    HDFS Client

    DistributedFileSystem provides APIs for users to develop HDFS-based applications.

    FsShell allows users to perform common file system operations such as create, delete, etc. through the HDFS shell command- hdfs dfs or hadoop fs.

    DFSAdmin provides system administrator with management shell command- hdfs dfsadmin, such as performing upgrades, managing security modes, and more.

    DistributedFileSystem, FsShell, and DFSAdmin both manage and manipulate HDFS by directly or indirectly holding a reference to DFSClient and then calling the interface provided by DFSClient. DFSClient encapsulates complicated interaction logic, providing a simple interface to the outsides.

    图片包含 屏幕截图 描述已自动生成

    the class diagram

    HDFS Communication Protocol

    All HDFS communication protocols are layered on top of the TCP/IP protocol.

    The HDFS Client talks to the NN using ClientProtocol.

    The HDFS Client talks to the DNs using ClientDatanodeProtocol.

    The Secondary NN talks to the NN using NamenodeProtocol.

    DNs talk to the NN using the DatanodeProtocol.

    DNs talk to each other using the InterDatanodeProtocol.

    Main structure of NameNode

    namespace

    Namespace is a hierarchy of files and directories.

    Every file, directory and block in HDFS is represented as an object in the NN memory. Files and directories are represented on the NN by INode. We call INode and Block as namespace objects. Each namespace object consumes approximately 150 bytes memory.

    metadata

    Namespace is nothing but the filesystem tree, while metadata represents the structure of HDFS directories and files in a tree. The NN maintains the filesystem tree and the metadata for all the files and directories in the tree.

    Metadata contains various information related to directories and files like ownership, permissions, quotas, replication factor, mapping of blocks to files, and mapping of blocks to DNs, etc.

    There are two files associated with metadata: fsimage (an image of the file system) and edits (a series of modifications made to the file system).

    When NN crashes, we can use metadata to reconstruct the entire file system. So it’s important these metadata are safely persisted to stable storage.

    Class Diagram

    图片包含 屏幕截图, 监视器 描述已自动生成

    the class diagram

    FSNamesystem is a container of both transient and persisted namespace state, and does all the book-keeping work on a NameNode. Both FSDirectory and FSNamesystem manage the state of the namespace. FSDirectory is a pure in-memory data structure, all of whose operations happen entirely in memory. In contrast, FSNamesystem persists the operations to the disk.

    FSImage handles checkpointing and logging of the namespace edits. FSEditLog maintains a log of the namespace modifications.

    FSDirectory is responsible for maintaining the tree structure of the entire file system. But the efficiency of data lookup in the tree structure is very low. The introduction of BlocksMap makes the time complexity of fetching data equal to O(1). BlocksMap maintains the mapping of blocks to files and mapping of blocks to DNs.

    fsimage and edits

    When the NN is formatted, it creates a data structure that contains fsimage, edits, and VERSION. The NN uses files in its local host OS file system to store fsimage and edits. We can think of HDFS metadata as consisting of two parts: fsimage and edits.

    fsimage

    An fsimage file contains the complete state of the file system (except the mapping of blocks to DNs) at a point in time. The entire file system namespace, including the mapping of blocks to files and file system properties is stored in fsimage file.

    edits

    An edits file is a log that lists each file system change (file creation, deletion or modification) that was made after the most recent fsimage.

    NameNode Metadata directory

    This screenshot is an example of an HDFS metadata directory taken from a NameNode. In this example, the same directory has been used for both fsimage and edits. Alternative configuration options are available that allow separating fsimage and edits into different directories.

    图片包含 文字, 报纸 描述已自动生成

    VERSION

    Text file that contains the following elements:

    Version of the HDFS metadata format. When we add new features that require a change to the metadata format, we change this number.

    Unique identifiers of an HDFS cluster. These identifiers are used to prevent DataNodes from registering accidentally with an incorrect NameNode that is part of a different cluster. These identifiers also are particularly important in a federated deployment. Within a federated deployment, there are multiple NameNodes working independently. Each NameNode serves a unique portion of the namespace (namespaceID) and manages a unique set of blocks (blockpoolID). The clusterID ties the whole cluster together as a single logical unit. This structure is the same across all nodes in the cluster.

    Always NAME_NODE for the NameNode, and never JOURNAL_NODE.

    Creation time of file system state.

    edits_start transaction ID-end transaction ID

    Finalized and unmodifiable edit log segments. Each of these files contains all of the edit log transactions in the range defined by the file name. In an High Availability deployment, the standby can only read up through the finalized log segments. The standby NameNode is not up-to-date with the current edit log in progress. When an HA failover happens, the failover finalizes the current log segment so that it is completely caught up before switching to active.

    edits_inprogress_start transaction ID

    This is the current edit log in progress. All transactions starting from the specified transaction are in this file, and all new incoming transactions will get appended to this file.

    fsimage_end transaction ID

    Contains the complete metadata image up through. Each fsimage file also has a corresponding md5 file containing a MD5 checksum, which HDFS uses to guard against disk corruption.

    seen_txid

    Contains the last transaction ID of the last checkpoint (merge of edits into an fsimage) or edit log roll (finalization of current edits_inprogress and creation of a new one). This is not the last transaction ID accepted by the NameNode. The file is not updated on every transaction, only on a checkpoint or an edit log roll. The purpose of this file is to try to identify if edits are missing during startup. It is possible to configure the NameNode to use separate directories for fsimage and edits files. If the edits directory accidentally gets deleted, then all transactions since the last checkpoint would go away, and the NameNode starts up using just fsimage at an old state. To guard against this, NameNode startup also checks seen_txid to verify that it can load transactions at least up through that number. It aborts startup if it cannot verify the load transactions.

    When restarting, the NN can establish complete namespace through fsimage, edits and Blockreport. After the NN loads fsimage(also edits), the mapping of blocks to DNs has not been there. This needs to be collected dynamically by the DN Blockreport.

    Checkpointing

    Checkpointing is a process that takes an fsimage and edits and compacts them into a new fsimage.

    NN maintains namespace in HDFS. All metadata is stored in fsimage and edits at NN. Secondary NN does checkpointing for NN.

    As checkpointing is the process to merge edits into fsimage, the process is resource intensive and it can impact ongoing request at NameNode.

    When the NN starts up, or a checkpoint is triggered by a configurable threshold, it reads the fsimage and edits from disk, applies all the transactions from the edits to the in-memory representation of the fsimage, and flushes out this new version into a new fsimage on disk. It can then truncate the old edits because its transactions have been applied to the persistent fsimage. 

    A checkpoint can be triggered at a given time interval (dfs.namenode.checkpoint.period) expressed in seconds, or after a given number of filesystem transactions have accumulated (dfs.namenode.checkpoint.txns, dfs.namenode.checkpoint.check.period). If both of these properties are set, the first threshold to be reached triggers a checkpoint.

    Checkpointing Process

    图片包含 屏幕截图 描述已自动生成

    Secondary NN checks whether either of the two preconditions are met.

    When it’s time to perform the checkpoint, the NN creates a new file to accept the file system changes. It names the new file edits.new.

    The edits file, along with the fsimage file, are copied to the Secondary NN.

    The Secondary NN merges these two files, creating a file named fsimage.ckpt.

    The Secondary copies the fsimage.ckpt file to the NN.

    The NN overwrites the file fsimage with fsimage.ckpt and renames the edits.new to edits.

    Heartbeat vs Blockreport

    The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode.

    Heartbeat

    DNs send heartbeats to the NN to confirm that the DataNode is operating and the block replicas it hosts are available.

    The default heartbeat interval is 3 seconds. Interval of Heartbeat is determined by configuration parameter dfs.heartbeat.interval in hdfs-site.xml. If the NN does not receive a heartbeat from a DN in 10 minutes the NN considers the DataNode to be out of service and the block replicas hosted by that DN to be unavailable. The NN then schedules creation of new replicas of those blocks on other DNs.

    Heartbeats from a DN also carry information about total storage capacity, fraction of storage in use, and the number of data transfers currently in progress. These statistics are used for the NN's block allocation and load balancing decisions.

    The NN does not directly send requests to DNs. It uses replies to heartbeats to send instructions to the DNs. The instructions include commands to replicate blocks to other nodes, remove local block replicas, re-register and send an immediate block report, and shut down the node.

    Blockreport

    When a DN starts up, it scans through its local file system, generates a list of all data blocks that correspond to each of these local files and sends this report to the NN.

    The first block report is sent immediately after the DN registration. Subsequent block reports are sent periodically and provide the NN with an up-to-date view of where block replicas are located on the cluster. Interval of Blockreport is determined by configuration dfs.blockreport.intervalMsec in hdfs-site.xml. By default this is set to 21600000 milliseconds (6 hours).

    Rack Awareness

    In a large cluster, in order to improve the network traffic, while reading/writing HDFS file, NN chooses the DN which is closer to the same rack or nearby rack to Read/Write request. NN maintains Rack ids of each DN to achieve rack information. This process of choosing nearby DNs based on Rack ID is called as Rack Awareness.

    By default, the NN has no idea which node is in which rack. It therefore by default assumes that all nodes are in the same rack, which is likely true for small clusters. It calls this rack /default-rack.

    NN obtains the rack id of the cluster nodes by invoking either an external script or java class as specified by configuration files,

    net.topology.script.file.name parameter in the configuration file

    net.topology.node.switch.mapping.impl parameter in the configuration file

    Network Topology

    distance(/d1/r1/n1, /d1/r1/n1) = 0 (same node)

    distance(/d1/r1/n1, /d1/r1/n2) = 2 (different nodes on the same rack)

    distance(/d1/r1/n1, /d1/r2/n3) = 4 (nodes on different racks in the same data center)

    distance(/d1/r1/n1, /d2/r3/n4) = 6 (nodes in different data centers)

    图片包含 计算器, 电子产品, 计算机 描述已自动生成

    Data Block Replicas

    HDFS block placement will use rack awareness for fault tolerance by placing one block replica on a different rack.

    HDFS stores each file as a sequence of blocks. All blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance.

    The block size and replication factor are configurable per file. The default size of a block in HDFS is 128 MB (Hadoop 2.x) and 64 MB (Hadoop 1.x). The default value of replication factor is 3. Maximum number of replicas created is the total number of DNs at that time as the NN does not allow one DN to have multiple replicas of the same block.

    Hadoop divides a file into blocks based on bytes, without taking into account the logical records within the file. That means the start of an HDFS block typically contains a remainder of the

    Enjoying the preview?
    Page 1 of 1