0% found this document useful (0 votes)
50 views

Getting An Overview of Big Data (Module1)

The document provides an overview of big data, including what it is, its evolution and management, types of data, elements, structuring data, and advantages of big data analytics. Big data refers to large and complex datasets that traditional data management tools cannot handle. It has evolved from traditional databases and data warehouses to modern big data technologies and platforms.

Uploaded by

Nihal Koche
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

Getting An Overview of Big Data (Module1)

The document provides an overview of big data, including what it is, its evolution and management, types of data, elements, structuring data, and advantages of big data analytics. Big data refers to large and complex datasets that traditional data management tools cannot handle. It has evolved from traditional databases and data warehouses to modern big data technologies and platforms.

Uploaded by

Nihal Koche
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 58

Getting an Overview of Big Data:

What is Big Data


• Big Data in business analytics refers to the process of collecting,
examining, and analyzing large amounts of data to discover market
trends, insights, and patterns that can help companies make better
business decisions. This information is available quickly and efficiently
so that companies can be agile in crafting plans to maintain their
competitive advantage.
• These are the major considerations for businesses looking to implement a big
data analytics system. They need to be able to process a lot of data from
different sources at high speeds, and then have confidence in the reliability of
the end result.
• Big data analytics can be applied to both structured and unstructured data.
Structured data is highly organized and quantitative, making it the easiest to
digest and use. Unstructured data includes photos, videos, audio files, text, etc.
While it is more difficult to scrape information from unstructured data, it can be
more enriching than structured data.
• Big data analytics and business analytics share a few similarities but are distinct
categories of software. Big data analytics can also be used to enrich business
analytics. It has forced an evolution in the business analytics world,
demonstrating how the world of business intelligence has changed.
Evolution of Data management
• The evolution of data management in Big Data Analytics has been a transformative journey. Here's a
brief overview:
• Traditional Databases: The journey began with traditional databases designed for structured data.
These databases were efficient for managing small volumes of structured data but struggled with large
volumes and varieties of data
• Data Warehouses: To overcome the limitations of traditional databases, data warehouses were
introduced. They allowed for the storage and analysis of larger volumes of structured data. However,
they still struggled with unstructured and semi-structured data
• Big Data Technologies: With the advent of the internet and digital technologies, the volume, velocity,
and variety of data increased exponentially. This led to the development of big data technologies like
Hadoop and NoSQL databases. These technologies were designed to process vast volumes of
structured, semi-structured, and unstructured data
• Big Data Analytics Platforms: The latest evolution in data management is the development of big
data analytics platforms. These platforms not only store and process big data but also provide
advanced analytics capabilities. They use technologies like machine learning and artificial intelligence
to derive insights from big data
• Future Trends: The future of data management in big data analytics looks promising with the advent
of technologies like AI and machine learning, which are expected to further revolutionize the field. This
evolution has enabled organizations to leverage their data in unprecedented ways, leading to
improved decision-making, enhanced operational efficiency, and the creation of new products and
Structuring Big data
• Structuring Big Data in business analytics involves organizing and formatting the data in a way that
it can be easily stored, processed, and analyzed . Here's a brief overview:
• Structured Data: This is data that is organized in a predefined manner and is easy to store and
query. Examples include data stored in relational databases and spreadsheets. Structured data is
typically organized in rows and columns, with each column representing a specific attribute and
each row representing a record
• Unstructured Data: This is data that does not have a predefined format or organization. Examples
include text documents, social media posts, images, videos, and audio files. Unstructured data is
more challenging to analyze than structured data due to its lack of organization
• Semi-Structured Data: This is a type of data that does not conform to the formal structure of data
models but contains tags or other markers to separate semantic elements. Examples include XML
and JSON files
• The process of structuring Big Data involves several steps
• Data Collection: This involves gathering data from various sources, which could be structured, semi-
structured, or unstructured.
• Data Cleaning: This step involves removing errors, inconsistencies, and redundancies in the data.
• Data Transformation: This involves converting the data into a format that can be easily analyzed. For
unstructured and semi-structured data, this could involve extracting relevant information and converting it
into a structured format.
• Data Storage: The structured data is then stored in a database or data warehouse where it can be easily
accessed for analysis.
• Data Analysis: Finally, the structured data is analyzed using various data analysis tools and techniques to
derive insights.
Types of Data

2.5 quintillion bytes of data are generated every day by users. Predictions by
Statista suggest that by the end of 2021, 74 Zettabytes( 74 trillion GBs) of data
would be generated by the internet. Managing such a vacuous and perennial
outsourcing of data is increasingly difficult. So, to manage such huge complex
data, Big data was introduced, it is related to the extraction of large and complex
data into meaningful data which can’t be extracted or analyzed by traditional
methods.
• Structured Data: Structured data refers to well-organized data that fits neatly
into relational databases or tabular formats. It is typically organized in rows and
columns, with a predefined schema. Examples include data from databases,
spreadsheets, and transaction logs.
• Semi-structured Data: Semi-structured data falls somewhere between
structured and unstructured data. It has some organizational properties
but does not conform to a strict schema. Examples include XML files,
JSON (JavaScript Object Notation) documents, and log files.
• Unstructured Data: Unstructured data lacks a predefined schema and
does not fit neatly into traditional databases. It includes text
documents, emails, social media posts, multimedia files (such as
images and videos), sensor data, and web logs. Analyzing unstructured
data often requires advanced techniques such as natural language
processing (NLP) and image recognition.
Elements of Big data
• The elements of Big Data are often described by multiple 'V's. Here are the most commonly referred
ones:
1. Volume: The name 'Big Data' itself is related to a size which is enormous. Volume refers to the
massive amount of data generated every second from various sources like business processes,
machines, social media platforms, networks, human interactions, and more.
2. Velocity: Velocity refers to the speed at which data is created, stored, analyzed, and visualized. In the
context of Big Data, the speed at which the data is generated and processed is incredibly high.
3. Variety: Variety refers to the different types of data we can now use. Data can be structured, semi-
structured, or unstructured, and can be gathered from various sources .
4. Veracity: Veracity refers to the quality of the data, which can vary greatly. Data veracity reflects the
truthfulness of a data set and your level of confidence in it .
5. Value: Value refers to our ability turn our data into value. It is the ability to turn data into
information and information into insights. This is critical for businesses as they seek to gain a
competitive edge
what is Big data Analytics

• Big data analytics is important because it helps companies leverage


their data to identify opportunities for improvement and
optimization. Across different business segments, increasing
efficiency leads to overall more intelligent operations, higher profits,
and satisfied customers. Big data analytics helps companies reduce
costs and develop better, customer-centric products and services.
• Data analytics helps provide insights that improve the way our
society functions. In health care, big data analytics not only keeps
track of and analyses individual records but it plays a critical role in
measuring outcomes on a global scale. During the COVID-19
pandemic, big data-informed health ministries within each nation’s
government on how to proceed with vaccinations and devised
solutions for mitigating pandemic outbreaks in the future.
Advantages of Big data Analytics
• Big Data Analytics offers numerous advantages to businesses and organizations. Here are some key
benefits:
• Cost Optimization: Big Data technologies reduce the cost of storing, processing, and analyzing
large volumes of data for enterprises. They can also help find cost-effective and efficient business
practices
• Improved Efficiency: Big Data techniques can dramatically enhance operational efficiency. They can
collect vast volumes of usable customer data, providing valuable insights that can streamline
operations
• Understanding Market Conditions: By examining Big Data, businesses can gain a better
understanding of current market conditions. For example, a company can determine the most
popular products by studying a customer's purchase behavior. This aids in the analysis of trends and
customer desires
• Product Development: Big Data Analytics can provide a better understanding of customer needs,
leading to the development of products and services that are more aligned with what customers
want
• New Revenue Opportunities: Big Data Analytics can help organizations identify new opportunities for revenue
generation. This could be through the identification of new markets, the development of new products, or the
improvement of existing products
• Improved Decision Making: Big Data Analytics enables organizations to make data-driven decisions. This leads
to smarter business moves, more efficient operations, higher profits, and happier customers
• Personalization and Customer Service: Big Data Analytics can help improve customer service and create a
personalized experience for customers. By understanding customer behavior and preferences, companies can
tailor their products and services to meet the specific needs of their customers
• These are just a few of the many advantages that Big Data Analytics can offer. The specific benefits can vary
depending on the industry and the specific use case.
Careers and Future of Big data
• Big Data Analytics offers a wide range of career opportunities. Here are some of
the top careers in this field
• Data Scientist: They work to deeply understand and analyze data to provide actionable insights
• Data Analyst : They collect, process, and perform statistical analyses of data
3. Data Engineer: They prepare the "big data" infrastructure to be analyzed by Data Scientists
4. Data Architect: They create the blueprints for data management systems
5. Business Analyst: They bridge the gap between IT and the business using data analytics to assess processes,
determine requirements, and deliver data-driven recommendations.
6. Data Administrator: They ensure that databases are available to all relevant users, are performing properly, and
are being kept safe.
7. Business Intelligence (BI) Developer: They design and develop strategies to assist business users in quickly
finding the information they need to make better business decisions
• The future of Big Data Analytics looks promising with the advent of technologies like
AI and machine learning, which are expected to further revolutionize the field Here
are some key trends for the future of Big Data Analytics :
8. Increasing Velocity of Big Data Analytics: The future of big data analytics will increasingly focus on data
freshness with the ultimate goal of real-time analysis, enabling better-informed decisions and increased
competitiveness
9. Artificial Intelligence and Machine Learning: AI and ML are being increasingly used in Big Data Analytics for
predictive modeling, anomaly detection, and automation
10.Data Privacy and Security: As the volume of data grows, so does the need for robust data privacy and security
measures
11.ncreased Cloud Adoption: More and more businesses are moving their data to the cloud for cost-effective,
scalable, and secure data storage and analytics
12.Natural Language Processing: NLP is being used to analyze unstructured data, such as text and voice data, to
derive insights.
13.Predictive Analytics: The use of predictive analytics is expected to grow, allowing businesses to forecast future
trends and make proactive decisions
• These trends indicate that Big Data Analytics will continue to play a crucial role in decision-making processes
across various industries
Introducing Technologies for Handling Big data:
• There are several technologies that are commonly used for handling Big Data in Big Data
Analytics
1. Apache Hadoop: An open-source framework that allows for the distributed processing of large data sets across clusters of computers
using simple programming models
2. Apache Spark: An open-source, distributed computing system used for big data processing and analytics1.
3. MongoDB: A source-available cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB
uses JSON-like documents with optional schemas
4. Cassandra: A highly scalable, high-performance distributed database designed to handle large amounts of data across many commodity
servers, providing high availability with no single point of failure
5. Presto: An open-source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging
from gigabytes to petabytes
6. RapidMiner: A data science platform that provides an integrated environment for data preparation, machine learning, deep learning, text
mining, and predictive analytics
7. ElasticSearch: A search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an
HTTP web interface and schema-free JSON documents
8. Kafka: A distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics,
data integration, and mission-critical applications
9. Splunk: A software platform widely used for monitoring, searching, analyzing and visualizing the machine-generated data in real time.
10.KNIME: A free and open-source data analytics, reporting and integration platform. KNIME integrates various components for machine
11.Apache Spark: An open-source, distributed computing system used for big data processing and
analytics
12.Tableau: A powerful data visualization tool used in the Business Intelligence Industry. It helps
simplify raw data into an easily understandable format1.
• These technologies are used in various stages of Big Data Analytics, including data storage, data
mining, data analytics, and data visualization
Distributed and Parallel Computing in Big
Data
• Distributed and parallel computing are two computational methods
that are often used in the context of Big Data
Parallel Computing
• In parallel computing, multiple processors perform multiple tasks simultaneously.
• All processors may have access to a shared memory to exchange information between processors 3.
• Parallel computing provides concurrency and saves time and money
• Memory in parallel systems can either be shared or distributed
Distributed Computing
• In distributed computing, a single task is divided among different computers.
• Each processor has its own private memory (distributed memory)
• Information is exchanged by passing messages between the processors
• Distributed computing improves system scalability, fault tolerance, and resource sharing capabilities
• In the context of Big Data, both parallel and distributed computing play crucial roles
• Handling Large Volumes of Data: Big Data often involves handling a huge volume of data that cannot be
processed effectively with a single machine. Distributed systems divide and conquer the tasks, allowing for the
acquisition and analysis of intelligence from Big Data
• Speeding Up Processing: Parallel computing is essential in Big Data scenarios to perform large-scale data analysis
and processing tasks concurrently, thereby significantly reducing the time required4.
• Scalability: As data volumes continue to grow, distributed computing allows systems to easily scale up by adding
more machines to the network
• Fault Tolerance: Distributed systems are designed to continue operating even if some machines or components
fail, which is crucial for Big Data applications where downtime can be costly
• Resource Sharing: Distributed computing allows different users and applications to access resources such as data
and computational power, which is important in a Big Data context where resources are often pooled
• It’s important to note that the choice between parallel and distributed computing depends on the specific
requirements of the task, including the volume of data, the computational power of the machines available, and
the nature of the problem to be solved. In many real-world applications, a combination of both methods is often
employed
Introducing Hadoop
• Hadoop is an open-source software framework that is used for storing and processing large
amounts of data in a distributed computing environment12. It is designed to handle big data and is
based on the MapReduce programming model, which allows for the parallel processing of large
datasets12.
• Hadoop's framework is based on Java programming with some native code in C and shell scripts1.
It has two main components1:
1. HDFS (Hadoop Distributed File System): This is the storage component of Hadoop, which allows
for the storage of large amounts of data across multiple machines. It is designed to work with
commodity hardware, which makes it cost-effective1.
2. YARN (Yet Another Resource Negotiator): This is the resource management component of
Hadoop, which manages the allocation of resources (such as CPU and memory) for processing the
data stored in HDFS1.
• Hadoop also includes several additional modules that provide additional functionality, such as Hive (a SQL-like
query language), Pig (a high-level platform for creating MapReduce programs), and HBase (a non-relational,
distributed database)1.
• Hadoop is commonly used in big data scenarios such as data warehousing, business intelligence, and machine
learning. It’s also used for data processing, data analysis, and data mining 1. It enables the distributed processing
of large data sets across clusters of computers using a simple programming model 1.
• The history of Hadoop is quite interesting. Apache Software Foundation is the developers of Hadoop, and it’s
co-founders are Doug Cutting and Mike Cafarella. It’s co-founder Doug Cutting named it on his son’s toy
elephant1. In October 2003 the first paper release was Google File System. In January 2006, MapReduce
development started on the Apache Nutch which consisted of around 6000 lines coding for it and around 5000
lines coding for HDFS. In April 2006 Hadoop 0.1.0 was released 1.
• Hadoop was created by Apache Software Foundation in 2006, based on a white paper written by Google in 2003
that described the Google File System (GFS) and the MapReduce programming model1. The Hadoop framework
HDFS and Map reduce
Cloud computing and big data
Features of Cloud Computing
Understanding Hadoop Ecosystem:
• Overview: Apache Hadoop is an open source framework intended to
make interaction with big data easier, However, for those who are not
acquainted with this technology, one question arises that what is big
data ? Big data is a term given to the data sets which can’t be
processed in an efficient manner with the help of traditional
methodology such as RDBMS. Hadoop has made its place in the
industries and companies that need to work on large data sets which
are sensitive and needs efficient handling. Hadoop is a framework
that enables processing of large data sets which reside in the form of
clusters. Being a framework, Hadoop is made up of several modules
that are supported by a large ecosystem of technologies.
Hadoop Ecosystem
• Introduction: Hadoop Ecosystem is a platform or a suite which
provides various services to solve the big data problems. It includes
Apache projects and various commercial tools and solutions. There
are four major elements of Hadoop i.e. HDFS, MapReduce, YARN,
and Hadoop Common Utilities. Most of the tools or solutions are
used to supplement or support these major elements. All these tools
work collectively to provide services such as absorption, analysis,
storage and maintenance of data etc.
• Following are the components that collectively form a Hadoop
ecosystem:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
Hadoop Distributed file system
• The Hadoop Distributed File System (HDFS) is a distributed file system
designed to run on commodity hardware. It has become a key tool for
managing pools of big data and supporting big data analytics
applications
• Here are some key features of HDFS:
1. Distributed Storage: HDFS stores each file as a sequence of blocks, all blocks in a file except the last
block are the same size. The files in HDFS are broken down into block-sized chunks, which are stored as
independent units.
2. Fault Tolerance: HDFS is designed to carry on working without a noticeable interruption to the user in
case of a failure. Each block is replicated multiple times. Default replication value is 3. In case of a failure,
for example when a machine crashes, data does not lose as it is replicated and stored in some other
machine
3. High Throughput: HDFS is designed to support applications with large data sets, including individual
files that reach into the terabytes
4. Scalability: HDFS can support single clusters of thousands of nodes, each of which store part of the file
system's data
• The main components of HDFS are:
• NameNode: This is the master server that manages the file system namespace and regulates access to files by
clients
• DataNodes: These are the workhorses of the file system. They store and retrieve blocks when they are told to (by
clients or the NameNode), and they report back to the NameNode periodically with lists of blocks that they are
storing
• Your data is stored in HDFS in a distributed manner, which means it is split into chunks, or blocks, and
distributed across multiple nodes in a cluster
• This allows for efficient processing and analysis of data, particularly for large-scale, parallelizable tasks
HDFS Architecture
HDFS is composed of master-slave architecture,
which includes the following elements:
• All the blocks on DataNodes are handled by NameNode, which is known as the master node. It performs
the following functions:
1. Monitor and control all the DataNodes instances.
2. Permits the user to access a file.
3. Stores all of the block records on a DataNode instance.
4. EditLogs are committed to disk after every write operation to Name Node’s data storage. The data is then
replicated to all the other data nodes, including Data Node and Backup Data Node. In the event of a
system failure, EditLogs can be manually recovered by Data Node.
5. All of the DataNodes’ blocks must be alive in order for all of the blocks to be removed from the data
nodes.
6. Therefore, every UpdateNode in a cluster is aware of every DataNode in the cluster, but only one of
them is actively managing communication with all the DataNodes. Since every DataNode runs their own
software, they are completely independent. Therefore, if a DataNode fails, the DataNode will be replaced
by another DataNode. This means that the failure of a DataNode will not impact the rest of the cluster,
since all the DataNodes are aware of every DataNode in the cluster.
• There are two kinds of files in NameNode: FsImage files and EditLogs
files:
1.FsImage: It contains all the details about a filesystem, including all
the directories and files, in a hierarchical format. It is also called a file
image because it resembles a photograph.
2.EditLogs: The EditLogs file keeps track of what modifications have
been made to the files of the filesystem.
Secondary NameNode
• When NameNode runs out of disk space, a secondary NameNode is activated to perform a
checkpoint. The secondary NameNode performs the following duties.
1. It stores all the transaction log data (from all the source databases) into one location so that when
you want to replay it, it is at one single location. Once the data is stored, it is replicated across all the
servers, either directly or via a distributed file system.
2.The information stored in the filesystem is replicated across all the cluster nodes and stored in all the
data nodes. Data nodes store the data. The cluster nodes store the information about the cluster nodes.
This information is called metadata. When a data node reads data from the cluster, it uses the
metadata to determine where to send the data and what type of data it is. This metadata is also written
to a hard drive. The cluster nodes will write this information if the cluster is restarted. The cluster
will read this information and use it to determine where to send the data and what type of data it is.
3.The FsImage can be used to create a new replica of data, which can be used to scale up the data. If
the new FsImage needs to be used to create a new replica, this replication will start with a new
FsImage. There are some cases when it is necessary to recover from a failed FsImage. In this
situation, a new FsImage must be created from an old one. The FsImage can be used to create
backups of data. Data stored in the Hadoop cluster can be backed up and stored in another Hadoop
cluster, or the data can be stored on a local file system.
DataNode

• Every slave machine that contains data organsises a DataNode.


DataNode stores data in ext3 or ext4 file format on DataNodes.
DataNodes do the following:
1.DataNodes store every data.
2.It handles all of the requested operations on files, such as reading file
content and creating new data, as described above.
3.All the instructions are followed, including scrubbing data on
DataNodes, establishing partnerships, and so on.
Checkpoint Node

• It establishes checkpoints at specified intervals to generate checkpoint


nodes in FsImage and EditLogs from NameNode and joins them to
produce a new image. Whenever you generate FsImage and EditLogs
from NameNode and merge them to create a new image, checkpoint
nodes in HDFS create a checkpoint and deliver it to the NameNode.
The directory structure is always identical to that of the name node, so
the checkpointed image is always available.
Backup Node
• Backup nodes are used to provide high availability of the data. In case
one of the active NameNode or DataNodes goes down, the backup
node can be promoted to active and the active node switched over to
the backup node. Backup nodes are not used to recover from a failure
of the active NameNode or DataNodes. Instead, you use a replica set
of the data for that purpose. Data nodes are used to store the data and
to create the FsImage and editsLogs files for replication. Data nodes
connect to one or more replica sets of the data to create the FsImage
and editsLogs files for replication. Data nodes are not used to provide
high availability.
Blocks
• This default size can be changed to any value between 32 and 128 megabytes, depending on the performance required. Data is

written to the DataNodes every time a user makes a change, and new data is appended to the end of the DataNode. DataNodes

are replicated to ensure data consistency and fault tolerance. If a Node fails, the system automatically recovers the data from a

backup and replicates it across the remaining healthy Nodes. DataNodes do not store the data directly on the hard drives,

instead using the HDFS file system. This architecture allows HDFS to scale horizontally as the number of users and data types

increase. When the file size gets bigger, the block size gets bigger as well. When the file size becomes bigger than the block

size, the larger data is placed in the next block. For example, if the data is 135 MB and the block size is 128 MB, two blocks

will be created. The first block will be 128 MB, while the second block will be 135 MB. When the file size gets bigger than

that, the larger data will be placed in the next block. This ensures that the most data will always be stored at the same block.
HDFS Commands
• Here are some of the commonly used Hadoop Distributed File System
(HDFS) commands
1- ls: This command is used to list all the files. Use (lsr) for a recursive
approach. It is useful when we want a hierarchy of a folder
bin/hdfs dfs -ls <path>
2 - mkdir: To create a directory. In Hadoop dfs there is no home
directory by default. So let’s first create it
bin/hdfs dfs -mkdir <folder name>
• 3- touchz: It creates an empty file
• bin/hdfs dfs -touchz <file_path>

• copyFromLocal (or) put: To copy files/folders from local file system to


HDFS store. This is the most important command. Local filesystem means the
files present on the OS.
• bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>

• cat: To print file contents


• bin/hdfs dfs -cat <path>
• copyToLocal (or) get: To copy files/folders from HDFS store to local file
system.
• bin/hdfs dfs -copyToLocal <srcfile(on hdfs)> <local file dest>
• moveFromLocal: This command will move file from local to HDFS
• bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>
• cp: This command is used to copy files within HDFS
• bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
• mv: This command is used to move files within HDFS
• bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Remember, before using these commands, you need to start the Hadoop services using the following command:

• sbin/start-all.sh

• To check if the Hadoop services are up and running, use the following
command:
• jps
• These commands are executed in the terminal or command prompt
of your system where Hadoop is installed
Mapreduce
• MapReduce is a programming model and an associated implementation for processing and
generating big data sets with a parallel, distributed algorithm on a cluster 2. It's a core component of
the Hadoop framework
• A MapReduce program is composed of a Map procedure, which performs filtering and sorting, and
a Reduce method, which performs a summary operation2. The "MapReduce System" orchestrates
the processing by marshalling the distributed servers, running the various tasks in parallel,
managing all communications and data transfers between the various parts of the system, and
providing for redundancy and fault tolerance
Here's a brief overview of how MapReduce works:
1. Map Phase: The Map function takes an input pair and produces a set of intermediate key/value
pairs. The Map function is applied to every input data element 1.
2. Shuffle and Sort Phase: The MapReduce framework groups intermediate values based on their
intermediate keys using a process called Shuffle and Sort1.
3. Reduce Phase: In the Reduce phase, the reduce function is applied for each unique key in the
sorted order. The Reduce function takes an intermediate key and a set of values for that key. It
merges these values to form a possibly smaller set of values
• The key contributions of the MapReduce framework are not the actual map and reduce functions,
but the scalability and fault-tolerance achieved for a variety of applications due to parallelization
• The use of this model is beneficial only when the optimized distributed shuffle operation (which
reduces network communication cost) and fault tolerance features of the MapReduce framework
come into play
• MapReduce libraries have been written in many programming languages, with different levels of
optimization. A popular open-source implementation that has support for distributed shuffles is
part of Apache Hadoop
Hive
• Hive is a data warehouse infrastructure tool that resides on top of Hadoop to process structured
data5. It was developed by Facebook and is now used by other companies like Amazon and Netflix 2.
Hive provides an SQL-like interface between the user and the Hadoop distributed file system (HDFS),
making it easier to query and analyze large datasets stored in Hadoop's distributed file system
Here are some key features of Hive:
1. Hive Query Language (HiveQL): Hive uses a language called HiveQL, which is similar to SQL. This
allows users to express data queries, transformations, and analyses in a familiar syntax 2.
2. Data Warehousing: Hive is frequently used for data warehousing tasks like data encapsulation, ad-
hoc queries, and analysis of large datasets2.
3. Compatibility with Hadoop: Hive is built on top of Hadoop and integrates well with the Hadoop
ecosystem. It can work with data stored in HDFS or other compatible storage systems 24.
4. Scalability and Performance: Hive is designed to enhance scalability, extensibility, performance,
and fault tolerance2.
5. Support for Various Data Formats: Hive supports data stored in a variety of formats, including Text
Files, Sequence Files, RCFiles, Avro Data Files, ORC Files, and Parquet Files2.
6. Components of Hive: Hive includes components like HCatalog, a table and storage management
layer for Hadoop, and WebHCat, a service that provides an HTTP interface for Hadoop MapReduce,
Pig, Hive tasks, or Hive metadata operations2.
Please note that Hive is not built for Online Transactional Processing (OLTP) workloads 2. It is more
suited for batch processing rather than interactive use2. The emphasis is on high throughput of data
access rather than low latency of data access2.
Pig and Pig latin
• Apache Pig is a high-level platform used to process large datasets2. It provides a high level of
abstraction for processing over MapReduce2. The main components of Apache Pig are the Pig Latin
scripting language and the Pig Engine2.
• Pig Latin is a data flow language used by Apache Pig to analyze data in Hadoop1. It abstracts the
programming from the Java MapReduce idiom into a notation1. Pig Latin statements are used to
process the data1. Each statement must end with a semi-colon1. It may include expressions and
schemas1. By default, these statements are processed using multi-query execution 1
Here are some key features of Apache Pig:

1. Ease of programming: Pig Latin is easy to learn for programmers who are familiar with scripting
languages and SQL2.
2. Optimization opportunities: The system automatically optimizes the execution of Pig Latin scripts,
allowing the programmer to focus on semantics rather than efficiency 2.
3. Extensibility: Users can create their own functions to do special-purpose processing 2.
4. Handling of various data types: Pig Latin can handle various data types including tuples, bags, and
maps, which are not natively supported in MapReduce1.
5. Multi-query approach: Apache Pig reduces the length of codes by using a multi-query approach,
thereby reducing the development time2.

• Please note that while Pig Latin is used in the context of Apache Pig
and Hadoop, it is unrelated to the playful language game also known
as Pig Latin
Sqoop
• Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and
structured datastores such as relational databases1. It's a key component in the Hadoop ecosystem
for moving data from non-Hadoop data stores – such as relational databases and data warehouses –
into Hadoop
• Here are some key features of Sqoop:
1. Efficient Data Transfer: Sqoop uses a connector-based architecture which allows it to transfer data
between any relational database management system and Hadoop quickly and efficiently 12.
2. Import/Export Operations: Sqoop can import data from a relational database into HDFS, Hive or
HBase for further processing. It can also export the results back into the relational database 2.
3. Connectors for All Major RDBMS Databases: Sqoop includes connectors for multiple major
RDBMS databases, including MySQL, PostgreSQL, Oracle, SQL Server, and DB2 2.
• Parallel Import/Export: Sqoop uses MapReduce to import and export the data, which provides
parallel operation as well as fault tolerance2.
5. Incremental Load: Sqoop also supports incremental loads, which allows you to load only the new
or modified rows from the relational database into Hadoop2.
6. Hive Integration: Sqoop can also import the data directly into Hive by generating and executing a
CREATE TABLE statement to define the data's layout in Hive2.
• Sqoop successfully graduated from the Incubator in March of 2012 and is now a Top-Level Apache
project1. The latest stable release is 1.4.7
Zookeeper
• Apache Zookeeper is an open-source server that enables highly reliable distributed coordination. It's
a centralized service for maintaining configuration information, naming, providing distributed
synchronization, and providing group services1. These services are used in some form or another by
distributed applications1.
• Zookeeper provides a way to ensure that nodes in a distributed system are aware of each other and
can coordinate their actions. It does this by maintaining a hierarchical tree of data nodes called
“Znodes“, which can be used to store and retrieve data and maintain state information 2.
• ZooKeeper provides a set of primitives, such as locks, barriers, and queues, that can be used to
coordinate the actions of nodes in a distributed system. It also provides features such as leader
election, failover, and recovery, which can help ensure that the system is resilient to failures 2.
• ZooKeeper is widely used in distributed systems such as Hadoop, Kafka, and HBase, and it has
become an essential component of many distributed applications
Flume
• Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and
moving large amounts of log data1. It has a simple and flexible architecture based on streaming data
flows1. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery
mechanisms1. It uses a simple extensible data model that allows for online analytic application 1.
• Flume is designed to move the log data generated by application servers into HDFS at a higher speed 2.
It is used to import huge volumes of event data produced by social networking sites like Facebook and
Twitter, and e-commerce websites like Amazon and Flipkart2. Flume supports a large set of sources and
destination types2.
• Flume provides a steady flow of data between data producers and the centralized stores when the rate
of incoming data exceeds the rate at which data can be written to the destination 2. It also provides the
feature of contextual routing2. The transactions in Flume are channel-based where two transactions
(one sender and one receiver) are maintained for each message 2. It guarantees reliable message
delivery2.
• Flume can be scaled horizontally2. It is highly configurable and customizable2. It is used in conjunction
with Hadoop to create applications, load data from various sources like Twitter, and stream it to HDFS
Oozie
• Apache Oozie is a workflow scheduler system for managing Apache Hadoop jobs 12. Here are some
key points about Oozie:
• Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions 1.
• Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability 1
.
• It supports several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig,
Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts) 1.
• It consists of two parts: Workflow engine and Coordinator engine 2.
• Workflow engine is responsible for storing and running workflows composed of Hadoop jobs e.g.,
MapReduce, Pig, Hive2.
• Coordinator engine runs workflow jobs based on predefined schedules and availability of data 2.
• Oozie is scalable, reliable, extensible1, and flexible2. It can manage the timely execution of thousands of
workflows (each consisting of dozens of jobs) in a Hadoop cluster 2.
• It makes it very easy to rerun failed workflows2.

You might also like