Getting An Overview of Big Data (Module1)
Getting An Overview of Big Data (Module1)
2.5 quintillion bytes of data are generated every day by users. Predictions by
Statista suggest that by the end of 2021, 74 Zettabytes( 74 trillion GBs) of data
would be generated by the internet. Managing such a vacuous and perennial
outsourcing of data is increasingly difficult. So, to manage such huge complex
data, Big data was introduced, it is related to the extraction of large and complex
data into meaningful data which can’t be extracted or analyzed by traditional
methods.
• Structured Data: Structured data refers to well-organized data that fits neatly
into relational databases or tabular formats. It is typically organized in rows and
columns, with a predefined schema. Examples include data from databases,
spreadsheets, and transaction logs.
• Semi-structured Data: Semi-structured data falls somewhere between
structured and unstructured data. It has some organizational properties
but does not conform to a strict schema. Examples include XML files,
JSON (JavaScript Object Notation) documents, and log files.
• Unstructured Data: Unstructured data lacks a predefined schema and
does not fit neatly into traditional databases. It includes text
documents, emails, social media posts, multimedia files (such as
images and videos), sensor data, and web logs. Analyzing unstructured
data often requires advanced techniques such as natural language
processing (NLP) and image recognition.
Elements of Big data
• The elements of Big Data are often described by multiple 'V's. Here are the most commonly referred
ones:
1. Volume: The name 'Big Data' itself is related to a size which is enormous. Volume refers to the
massive amount of data generated every second from various sources like business processes,
machines, social media platforms, networks, human interactions, and more.
2. Velocity: Velocity refers to the speed at which data is created, stored, analyzed, and visualized. In the
context of Big Data, the speed at which the data is generated and processed is incredibly high.
3. Variety: Variety refers to the different types of data we can now use. Data can be structured, semi-
structured, or unstructured, and can be gathered from various sources .
4. Veracity: Veracity refers to the quality of the data, which can vary greatly. Data veracity reflects the
truthfulness of a data set and your level of confidence in it .
5. Value: Value refers to our ability turn our data into value. It is the ability to turn data into
information and information into insights. This is critical for businesses as they seek to gain a
competitive edge
what is Big data Analytics
written to the DataNodes every time a user makes a change, and new data is appended to the end of the DataNode. DataNodes
are replicated to ensure data consistency and fault tolerance. If a Node fails, the system automatically recovers the data from a
backup and replicates it across the remaining healthy Nodes. DataNodes do not store the data directly on the hard drives,
instead using the HDFS file system. This architecture allows HDFS to scale horizontally as the number of users and data types
increase. When the file size gets bigger, the block size gets bigger as well. When the file size becomes bigger than the block
size, the larger data is placed in the next block. For example, if the data is 135 MB and the block size is 128 MB, two blocks
will be created. The first block will be 128 MB, while the second block will be 135 MB. When the file size gets bigger than
that, the larger data will be placed in the next block. This ensures that the most data will always be stored at the same block.
HDFS Commands
• Here are some of the commonly used Hadoop Distributed File System
(HDFS) commands
1- ls: This command is used to list all the files. Use (lsr) for a recursive
approach. It is useful when we want a hierarchy of a folder
bin/hdfs dfs -ls <path>
2 - mkdir: To create a directory. In Hadoop dfs there is no home
directory by default. So let’s first create it
bin/hdfs dfs -mkdir <folder name>
• 3- touchz: It creates an empty file
• bin/hdfs dfs -touchz <file_path>
• sbin/start-all.sh
• To check if the Hadoop services are up and running, use the following
command:
• jps
• These commands are executed in the terminal or command prompt
of your system where Hadoop is installed
Mapreduce
• MapReduce is a programming model and an associated implementation for processing and
generating big data sets with a parallel, distributed algorithm on a cluster 2. It's a core component of
the Hadoop framework
• A MapReduce program is composed of a Map procedure, which performs filtering and sorting, and
a Reduce method, which performs a summary operation2. The "MapReduce System" orchestrates
the processing by marshalling the distributed servers, running the various tasks in parallel,
managing all communications and data transfers between the various parts of the system, and
providing for redundancy and fault tolerance
Here's a brief overview of how MapReduce works:
1. Map Phase: The Map function takes an input pair and produces a set of intermediate key/value
pairs. The Map function is applied to every input data element 1.
2. Shuffle and Sort Phase: The MapReduce framework groups intermediate values based on their
intermediate keys using a process called Shuffle and Sort1.
3. Reduce Phase: In the Reduce phase, the reduce function is applied for each unique key in the
sorted order. The Reduce function takes an intermediate key and a set of values for that key. It
merges these values to form a possibly smaller set of values
• The key contributions of the MapReduce framework are not the actual map and reduce functions,
but the scalability and fault-tolerance achieved for a variety of applications due to parallelization
• The use of this model is beneficial only when the optimized distributed shuffle operation (which
reduces network communication cost) and fault tolerance features of the MapReduce framework
come into play
• MapReduce libraries have been written in many programming languages, with different levels of
optimization. A popular open-source implementation that has support for distributed shuffles is
part of Apache Hadoop
Hive
• Hive is a data warehouse infrastructure tool that resides on top of Hadoop to process structured
data5. It was developed by Facebook and is now used by other companies like Amazon and Netflix 2.
Hive provides an SQL-like interface between the user and the Hadoop distributed file system (HDFS),
making it easier to query and analyze large datasets stored in Hadoop's distributed file system
Here are some key features of Hive:
1. Hive Query Language (HiveQL): Hive uses a language called HiveQL, which is similar to SQL. This
allows users to express data queries, transformations, and analyses in a familiar syntax 2.
2. Data Warehousing: Hive is frequently used for data warehousing tasks like data encapsulation, ad-
hoc queries, and analysis of large datasets2.
3. Compatibility with Hadoop: Hive is built on top of Hadoop and integrates well with the Hadoop
ecosystem. It can work with data stored in HDFS or other compatible storage systems 24.
4. Scalability and Performance: Hive is designed to enhance scalability, extensibility, performance,
and fault tolerance2.
5. Support for Various Data Formats: Hive supports data stored in a variety of formats, including Text
Files, Sequence Files, RCFiles, Avro Data Files, ORC Files, and Parquet Files2.
6. Components of Hive: Hive includes components like HCatalog, a table and storage management
layer for Hadoop, and WebHCat, a service that provides an HTTP interface for Hadoop MapReduce,
Pig, Hive tasks, or Hive metadata operations2.
Please note that Hive is not built for Online Transactional Processing (OLTP) workloads 2. It is more
suited for batch processing rather than interactive use2. The emphasis is on high throughput of data
access rather than low latency of data access2.
Pig and Pig latin
• Apache Pig is a high-level platform used to process large datasets2. It provides a high level of
abstraction for processing over MapReduce2. The main components of Apache Pig are the Pig Latin
scripting language and the Pig Engine2.
• Pig Latin is a data flow language used by Apache Pig to analyze data in Hadoop1. It abstracts the
programming from the Java MapReduce idiom into a notation1. Pig Latin statements are used to
process the data1. Each statement must end with a semi-colon1. It may include expressions and
schemas1. By default, these statements are processed using multi-query execution 1
Here are some key features of Apache Pig:
1. Ease of programming: Pig Latin is easy to learn for programmers who are familiar with scripting
languages and SQL2.
2. Optimization opportunities: The system automatically optimizes the execution of Pig Latin scripts,
allowing the programmer to focus on semantics rather than efficiency 2.
3. Extensibility: Users can create their own functions to do special-purpose processing 2.
4. Handling of various data types: Pig Latin can handle various data types including tuples, bags, and
maps, which are not natively supported in MapReduce1.
5. Multi-query approach: Apache Pig reduces the length of codes by using a multi-query approach,
thereby reducing the development time2.
• Please note that while Pig Latin is used in the context of Apache Pig
and Hadoop, it is unrelated to the playful language game also known
as Pig Latin
Sqoop
• Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and
structured datastores such as relational databases1. It's a key component in the Hadoop ecosystem
for moving data from non-Hadoop data stores – such as relational databases and data warehouses –
into Hadoop
• Here are some key features of Sqoop:
1. Efficient Data Transfer: Sqoop uses a connector-based architecture which allows it to transfer data
between any relational database management system and Hadoop quickly and efficiently 12.
2. Import/Export Operations: Sqoop can import data from a relational database into HDFS, Hive or
HBase for further processing. It can also export the results back into the relational database 2.
3. Connectors for All Major RDBMS Databases: Sqoop includes connectors for multiple major
RDBMS databases, including MySQL, PostgreSQL, Oracle, SQL Server, and DB2 2.
• Parallel Import/Export: Sqoop uses MapReduce to import and export the data, which provides
parallel operation as well as fault tolerance2.
5. Incremental Load: Sqoop also supports incremental loads, which allows you to load only the new
or modified rows from the relational database into Hadoop2.
6. Hive Integration: Sqoop can also import the data directly into Hive by generating and executing a
CREATE TABLE statement to define the data's layout in Hive2.
• Sqoop successfully graduated from the Incubator in March of 2012 and is now a Top-Level Apache
project1. The latest stable release is 1.4.7
Zookeeper
• Apache Zookeeper is an open-source server that enables highly reliable distributed coordination. It's
a centralized service for maintaining configuration information, naming, providing distributed
synchronization, and providing group services1. These services are used in some form or another by
distributed applications1.
• Zookeeper provides a way to ensure that nodes in a distributed system are aware of each other and
can coordinate their actions. It does this by maintaining a hierarchical tree of data nodes called
“Znodes“, which can be used to store and retrieve data and maintain state information 2.
• ZooKeeper provides a set of primitives, such as locks, barriers, and queues, that can be used to
coordinate the actions of nodes in a distributed system. It also provides features such as leader
election, failover, and recovery, which can help ensure that the system is resilient to failures 2.
• ZooKeeper is widely used in distributed systems such as Hadoop, Kafka, and HBase, and it has
become an essential component of many distributed applications
Flume
• Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and
moving large amounts of log data1. It has a simple and flexible architecture based on streaming data
flows1. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery
mechanisms1. It uses a simple extensible data model that allows for online analytic application 1.
• Flume is designed to move the log data generated by application servers into HDFS at a higher speed 2.
It is used to import huge volumes of event data produced by social networking sites like Facebook and
Twitter, and e-commerce websites like Amazon and Flipkart2. Flume supports a large set of sources and
destination types2.
• Flume provides a steady flow of data between data producers and the centralized stores when the rate
of incoming data exceeds the rate at which data can be written to the destination 2. It also provides the
feature of contextual routing2. The transactions in Flume are channel-based where two transactions
(one sender and one receiver) are maintained for each message 2. It guarantees reliable message
delivery2.
• Flume can be scaled horizontally2. It is highly configurable and customizable2. It is used in conjunction
with Hadoop to create applications, load data from various sources like Twitter, and stream it to HDFS
Oozie
• Apache Oozie is a workflow scheduler system for managing Apache Hadoop jobs 12. Here are some
key points about Oozie:
• Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions 1.
• Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability 1
.
• It supports several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig,
Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts) 1.
• It consists of two parts: Workflow engine and Coordinator engine 2.
• Workflow engine is responsible for storing and running workflows composed of Hadoop jobs e.g.,
MapReduce, Pig, Hive2.
• Coordinator engine runs workflow jobs based on predefined schedules and availability of data 2.
• Oozie is scalable, reliable, extensible1, and flexible2. It can manage the timely execution of thousands of
workflows (each consisting of dozens of jobs) in a Hadoop cluster 2.
• It makes it very easy to rerun failed workflows2.