0% found this document useful (0 votes)
17 views24 pages

Biggdata

just to download

Uploaded by

makoked553
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views24 pages

Biggdata

just to download

Uploaded by

makoked553
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

UNIT I

Big Data
Big Data refers to extremely large and complex datasets that traditional data processing
tools cannot handle efficiently. These datasets often come from various sources such as
social media, sensors, IoT devices, e-commerce platforms, and more.Big Data refers to
large amounts of massive data yet increases exponentially in size over time.Big Data
enables organizations to extract valuable insights for decision-making, improving customer
experiences, and optimizing operations.

Big Data is typically characterized by the following "5 Vs":


Volume: The size of the data, which is often measured in terabytes or petabytes. Example:
Social media platforms like Facebook generate terabytes of data daily.
Velocity: Describes the speed at which data is created, processed, and analyzed. Example:
Real-time data streams from sensors in IoT devices or financial transactions.
Variety: The different types of data formats, including structured (databases), semi-
structured (XML, JSON), and unstructured (videos, images, texts).
Veracity: Relates to the trustworthiness and accuracy of the data. Big Data often contains
noise, inconsistencies, or errors, which can affect analysis if not handled properly.
Value: The ultimate goal of Big Data is to derive value or actionable insights. Raw data itself
isn't useful unless processed and analyzed for meaningful applications, such as predicting
trends, improving customer experience, or optimizing operations.

Examples
Discovering consumer shopping habits.
Finding new customer leads.
Monitoring health conditions through data from wearables.
Predictive inventory ordering.

Why is Big Data Important? (Significance of Big Data)


Saves Cost
Saves Time
Helps to gain a better grasp of market conditions
Improve their online presence of companies
Boost Customer Acquisition and Retention
Solve Advertisers Problem and Offer Marketing Insights
Driver of Innovations and Product Development

Challenges with Big data


Data privacy: The Big Data we now generate contains a lot of information about our
personal lives, much of which we have a right to keep private.
Data security: Even if we decide we are happy for someone to have our data for a
purpose, can we trust them to keep it safe?
Data discrimination: When everything is known, will it become acceptable to discriminate
against people based on data we have on their lives? We already use credit scoring to
decide who can borrow money, and insurance is heavily data-driven.
Data quality: Not enough emphasis on quality and contextual relevance. The trend with
technology is collecting more raw data closer to the end user. The danger is data in raw
format has quality issues. Reducing the gap between the end user and raw data increases
issues in data quality.

Types of Digital Data


Structured Data
• Definition: Organized data that is stored in predefined formats, such as rows and
columns in a relational database.
• Characteristics:
• Easily searchable using query languages (e.g., SQL).
• Examples:
• Transaction records in databases.
• CSV files

Semi-Structured Data
• Definition: Data that does not have a strict structure but uses tags or markers to
separate elements and enforce hierarchies.
• Characteristics:
• Does not fit neatly into relational databases but has some organizational
properties.
• Often stored in formats that allow flexibility and scalability.
• Examples:
• JSON and XML files.
• NoSQL database records.
• Email metadata (e.g., sender, recipient, timestamp).
Quasi-Structured Data
Quasi-structured data refers to data that does not have a rigid structure like structured
data but still contains some organizational elements or markers that make it easier to parse
and analyze. It is often considered a subset of semi-structured data.
Characteristics:
1. Lacks a Fixed Schema:
• No strict organization into tables or rows but has some identifiable patterns
or tags.
2. Requires Parsing:
• Tools or algorithms are often needed to extract meaningful information.

3. Flexible Formats:
• The structure can vary within the same dataset, making it dynamic.
4. Intermediate Complexity:
• More structured than unstructured data but less structured than fully
structured data.
Examples:
• Sensor Logs
• Social Media Comments
• Email Logs

Unstructured Data
• Definition: Data that lacks a predefined format or organization, making it difficult to
store and process using traditional tools.
• Characteristics:
• Cannot be stored in rows and columns.
• Requires advanced tools (e.g., Hadoop, Spark) for processing and analysis.
• Examples:
• Text files, social media posts, and customer reviews.
• Multimedia content like images, videos, and audio files.

Hadoop
Hadoop is an open-source framework developed by the Apache Software Foundation. It is
designed to store, process, and analyze large-scale datasets efficiently. Hadoop uses a
distributed computing model, enabling it to handle Big Data across clusters of commodity
hardware.

Core Components of Hadoop


1. HDFS (Hadoop Distributed File System)
• A distributed file storage system that splits large files into smaller blocks and
stores them across a cluster of machines.
• Ensures data redundancy by replicating blocks across multiple nodes for fault
tolerance.
2. MapReduce
• A programming model for processing large datasets in parallel across a
Hadoop cluster.
• Consists of two main steps:
• Map: Processes input data and converts it into key-value pairs.
• Reduce: Aggregates the output of the Map step to generate
meaningful results.

3. YARN (Yet Another Resource Negotiator)


• A resource management layer that schedules and allocates resources to
various applications running on a Hadoop cluster.
4. Hadoop Common
• A set of utilities and libraries that support the other Hadoop modules.

Benefits of Hadoop
• Scalability: Easily scales from a single node to thousands of nodes.
• Fault Tolerance: Automatically replicates data to ensure availability even in case of
node failures.
• Cost-Effectiveness: Runs on inexpensive commodity hardware.
• Flexibility: Handles structured, semi-structured, and unstructured data.

Applications of Hadoop
• Fraud detection in financial services
• Customer sentiment analysis in marketing
• Recommendation systems in e-commerce
• Log analysis in IT and cybersecurity
• Genomic data analysis in healthcare
HDFS is not a good fit in following scenarios:
• Low-latency data access
• Lots of small files
• Multiple writers, arbitrary file modifications

Apache Hadoop
Apache Hadoop is an open-source framework designed for distributed storage and
processing of large datasets across clusters of computers. It is particularly suited for
handling Big Data and is developed and maintained by the Apache Software Foundation.

Workflow Example: Word Count in Hadoop


Let’s consider a simple example where Hadoop is used to count the frequency of words in a
text file:
1. File Storage:
• The input text file is stored in HDFS and split into blocks.
2. Map Phase:
• Each block is processed by a Mapper that emits key-value pairs (e.g., ("word",
1)).
3. Shuffling:
• All identical keys (e.g., "word") from different Mappers are grouped together.
4. Reduce Phase:
• The Reducer sums up the values for each key (e.g., ("word", 5) if "word"
appears 5 times).
5. Output:
• The final word counts are written back to HDFS.

Hadoop Streaming
Hadoop Streaming is a utility that allows developers to write MapReduce programs in any
programming language that can read from standard input (stdin) and write to standard
output (stdout). This means you can use languages like Python, Ruby or any other language
instead of Java to implement MapReduce logic in a Hadoop environment.

How Hadoop Streaming Works


Hadoop Streaming works by wrapping the MapReduce framework around your existing
code. The Hadoop framework handles the task of distributing and executing jobs across the
cluster, while the user’s script handles the map and reduce logic.
• Mapper: Reads input from standard input (stdin), processes it, and writes the output
(key-value pairs) to standard output (stdout).
• Reducer: Reads the output from the mappers, processes the key-value pairs, and
produces the final output.

Basic Workflow
1. Input: Data is passed to the Mapper via stdin (standard input).
2. Mapper: The script processes each line and outputs key-value pairs.
3. Shuffle and Sort: Hadoop sorts and groups the intermediate key-value pairs from
the mappers.
4. Reducer: The script processes grouped key-value pairs and outputs the final results
to stdout (standard output).
5. Output: The final output is written to HDFS or any other output location.

Advantages of Hadoop Streaming


1. Language Flexibility: You can use any language that supports standard input and
output, such as Python, Ruby, or even C++.
2. Quick Prototyping: You can quickly implement MapReduce logic without needing to
work in Java, which is often used in the native Hadoop framework.
3. Integration with Existing Code: It allows the use of existing code written in
languages like Python or Ruby, which may be more suitable for certain tasks.
Hadoop Ecosystem
The Hadoop Ecosystem consists of a suite of tools and frameworks built around Apache
Hadoop that support various functionalities, from data storage to processing and analysis.
These components help manage, store, process, and analyze large datasets, enabling
organizations to leverage big data effectively. Here’s an overview of the primary
components of the Hadoop ecosystem:

1. Hadoop Distributed File System (HDFS)


• HDFS is the storage layer of Hadoop. It stores large volumes of data across a
distributed cluster of machines.
• Data Blocks: Files are divided into blocks (typically 128MB or 256MB) for
storage across different nodes.
• Replication: Each block is replicated across different nodes (usually three
copies) to ensure fault tolerance.

2. MapReduce
• MapReduce is the programming model used for processing large datasets in parallel.
• Map Phase: Input data is processed in parallel by mappers, outputting key-
value pairs.
• Reduce Phase: The key-value pairs are aggregated by reducers to produce the
final output.
• This framework is used for batch processing and handles tasks like sorting, counting,
and filtering data.

Other Core Components of the Hadoop Ecosystem

3. YARN (Yet Another Resource Negotiator)


• YARN is the resource management layer of Hadoop that manages resources in the
cluster and schedules jobs.
• Resource Manager (RM): Manages resource allocation and job scheduling.
• Node Manager (NM): Runs on each node and monitors resource usage and
job status.

4. Hadoop Common
• Hadoop Common refers to the shared utilities, libraries, and APIs needed by other
Hadoop modules. It includes components that are essential for HDFS and
MapReduce to function.

Additional Tools and Frameworks in the Hadoop Ecosystem

5. Apache Hive
• Hive is a data warehouse system that provides a SQL-like interface for querying and
managing data stored in HDFS. It abstracts the complexities of MapReduce
programming by allowing users to write queries in HiveQL, a SQL-like language.
• Use Case: Useful for business intelligence (BI) and data analytics.

6. Apache HBase
• HBase is a NoSQL database that provides real-time, random access to large datasets.
• Built on top of HDFS, HBase is designed to handle very large datasets that
require quick read and write operations.
• Use Case: Real-time analytics and operational workloads requiring low-
latency data access.

7. Apache Pig
• Pig is a high-level platform for processing data using a language called Pig Latin. It
abstracts the complexities of writing MapReduce jobs, providing a simpler way to
express data transformations.
• Use Case: ETL (Extract, Transform, Load) processes and data transformation tasks.

8. Apache Spark
• Spark is an in-memory processing engine that provides faster data processing than
MapReduce. It can be used for both batch and stream processing, and it supports
complex analytics, machine learning, and graph processing.
• RDDs (Resilient Distributed Datasets): Core abstraction that enables
distributed data processing.
• Use Case: Faster processing and real-time analytics.

9. Apache Flume
• Flume is a distributed tool for collecting, aggregating, and moving large amounts of
log data from various sources to Hadoop (HDFS).
• Use Case: Streaming data ingestion, particularly for log data.

10. Apache Sqoop


• Sqoop is a tool for transferring bulk data between Hadoop and relational databases
(e.g., MySQL, Oracle).
• Use Case: Importing and exporting data to and from HDFS and relational
databases.

11. Apache Oozie


• Oozie is a workflow scheduler system that helps manage and schedule Hadoop jobs.
It allows the orchestration of data pipelines, combining different types of jobs
(MapReduce, Hive, Pig, etc.) into a unified workflow.
• Use Case: Automating complex workflows for data processing.

12. Apache Zookeeper


• Zookeeper is a coordination service for managing and synchronizing distributed
applications. It ensures that distributed systems can work together by maintaining
configuration information, synchronization, and naming.
• Use Case: Used by HBase, Kafka, and other components to coordinate distributed
services.

13. Apache Kafka


• Kafka is a distributed event streaming platform that can handle real-time data feeds.
It is often used to ingest and process large streams of data in real-time.
• Use Case: Real-time data streaming and integration with other Hadoop ecosystem
components.

Tools for Data Governance and Security

14. Apache Ranger


• Ranger is a framework for managing and enforcing security policies across Hadoop
components. It provides centralized security administration and auditing.
• Use Case: Fine-grained access control and security auditing.

15. Apache Atlas


• Atlas is a governance and metadata framework for managing data in the Hadoop
ecosystem. It helps in managing metadata, auditing, and ensuring compliance with
regulatory standards.
• Use Case: Data governance, lineage tracking, and metadata management.

Use Cases of the Hadoop Ecosystem


• Data Warehousing: Using Hive for SQL-like queries and Pig for ETL tasks to analyze
structured and unstructured data stored in HDFS.
• Real-Time Analytics: Using Apache Spark or HBase to analyze streaming data or
perform low-latency queries on large datasets.
• Data Ingestion: Using Apache Flume to ingest large volumes of log or sensor data
into HDFS for processing and analysis.
• Data Integration: Using Apache Sqoop to import data from relational databases into
Hadoop for analysis.

IBM Big Data Strategy

IBM has developed a comprehensive strategy for Big Data, focusing on enabling businesses
to harness the value of their data through advanced analytics, artificial intelligence (AI),
and cloud computing. IBM’s Big Data strategy emphasizes the integration of various
technologies to help businesses manage, analyze, and gain insights from massive volumes
of data.
Core Components of IBM’s Big Data Strategy
1. IBM Cloud
• IBM Cloud provides businesses with the tools and infrastructure to manage, store,
and process Big Data in a highly scalable and flexible cloud environment. It includes
both public and private cloud services, as well as hybrid cloud solutions.
• IBM Cloud Pak for Data is an integrated data and AI platform that allows businesses
to collect, organize, and analyze data from various sources, while providing tools for
governance and security.

2. Data Storage and Management


• IBM Db2: A relational database that supports high-performance transaction
processing and analytics.
• IBM Cloud Object Storage: A highly scalable, secure storage platform that allows
businesses to store large amounts of unstructured data.
• IBM DataStage: An ETL (Extract, Transform, Load) tool that helps organizations
prepare data for analysis by integrating it from different sources.

3. Analytics and AI
• IBM Watson: A suite of AI-powered analytics tools that help businesses gain insights
from data. Watson includes solutions like:
• Watson Studio: A platform for data scientists and developers to build and
train machine learning models.
• Watson Machine Learning: Provides tools for automating model deployment
and management.
• Watson Knowledge Catalog: An AI-driven data governance and cataloging
solution.
• IBM SPSS: A powerful tool for advanced statistical analysis, used for predictive
analytics and decision-making.
• IBM Cognos Analytics: A suite for data exploration, reporting, and visualization,
empowering businesses to make data-driven decisions.

4. Data Integration and Transformation


• IBM InfoSphere: A suite of data integration and transformation tools designed to
help organizations manage and manipulate data from various sources. It includes
tools like IBM InfoSphere DataStage and IBM InfoSphere Master Data
Management.
• IBM Data Virtualization: A technology that allows businesses to integrate disparate
data sources without physically moving the data, simplifying analytics and reducing
costs.

5. Real-Time Analytics
• IBM Streams: A real-time data analytics platform for analyzing high-velocity data
streams, enabling businesses to process data as it arrives and take immediate
actions.
6. Data Governance and Security
• IBM Watson Knowledge Catalog: Helps businesses organize, govern, and manage
their data assets, ensuring compliance with industry regulations.
• IBM Guardium: A data security and protection tool that monitors and protects
sensitive data.
• IBM Cloud Pak for Security: Offers solutions for securing data and workloads across
hybrid and multicloud environments.

7. Collaboration and Ecosystem


• IBM’s open-source approach: By embracing open-source technologies like Apache
Hadoop, Spark, and Kubernetes, IBM has made it easier for businesses to adopt
cutting-edge Big Data tools.

IBM InfoSphere BigInsights


IBM InfoSphere BigInsights is a comprehensive Big Data analytics platform that helps
organizations process and analyze large volumes of unstructured and structured data. It is
built on top of Apache Hadoop and enhances Hadoop's capabilities by providing advanced
analytics, integration with other IBM tools, and enterprise-level features, making it easier
to manage and analyze Big Data.

Key Features of IBM InfoSphere BigInsights


1. Built on Hadoop:
• InfoSphere BigInsights is built on top of the Apache Hadoop framework,
offering the scalability and fault tolerance of Hadoop’s distributed computing
model.
• It integrates with Hadoop’s storage (HDFS) and processing engines
(MapReduce, YARN).
2. Advanced Analytics:
• It supports complex analytics using Apache Spark, Hadoop MapReduce, and
SQL-like query processing via Hive and Big SQL.
• It provides tools for data mining, machine learning, and natural language
processing, enabling advanced analysis on large datasets.
3. Big SQL:
• Big SQL is IBM's high-performance SQL engine for querying data stored in
Hadoop. It allows users to run SQL queries across both structured and
unstructured data in Hadoop.
• It integrates with other IBM products like IBM Db2 and IBM DataStage for
seamless data management and analysis.
4. Data Integration and ETL:
• InfoSphere BigInsights integrates with IBM InfoSphere DataStage, which
allows for seamless data integration, transformation, and loading (ETL).
• It supports various data formats, making it easy to integrate data from
different sources (e.g., relational databases, flat files, cloud storage, etc.).
5. Data Security and Governance:
• It includes built-in security features such as encryption, authentication, and
access control to protect sensitive data.
• The InfoSphere Information Governance Catalog helps businesses maintain
proper data governance and metadata management.
6. Real-Time Analytics:
• InfoSphere BigInsights includes real-time data processing capabilities through
integration with tools like Apache Kafka and Apache Storm.
• It supports streaming analytics, enabling businesses to process and analyze
data in real time as it is ingested into the system.
7. User-Friendly Interfaces:
• The platform provides intuitive graphical interfaces and dashboards for easy
interaction with data, reducing the need for complex coding and manual
interventions.
• IBM InfoSphere BigInsights includes tools like InfoSphere BigInsights Query
Workbench, which simplifies the process of querying and managing large
datasets.
8. Scalability and Performance:
• Built to scale horizontally, InfoSphere BigInsights can handle increasing data
volumes by adding more nodes to the cluster.
• It optimizes performance through in-memory caching, parallel processing,
and workload management, ensuring efficient resource use and faster
insights.

IBM BigSheets
IBM BigSheets is a tool designed to help businesses analyze large datasets directly within
IBM Cloud using a familiar, spreadsheet-like interface. It is part of the IBM InfoSphere
BigInsights platform and leverages Hadoop's power to process and analyze Big Data
without the need for specialized programming skills.

Key Features of IBM BigSheets:


1. Spreadsheet Interface:
• Provides a user-friendly, spreadsheet-like interface for Big Data analysis,
making it accessible to non-technical users, such as business analysts.
• Users can work with large datasets directly within the interface without
requiring knowledge of complex coding or database queries.
2. Big Data Support:
• Integrates with Hadoop and can handle large-scale datasets (terabytes or
more) by leveraging the distributed power of Hadoop clusters for processing.
• It allows users to work with data stored in HDFS or in IBM Cloud Object
Storage.
3. Data Integration:
• BigSheets can pull in data from various sources, including Hadoop, relational
databases, and external data feeds, making it versatile for diverse data
analytics needs.
4. Advanced Analytics:
• Users can apply built-in statistical and analytical functions to the data, such as
aggregation, filtering, and transformations, similar to features found in Excel.
• Supports MapReduce and Apache Spark for processing large datasets,
allowing users to scale their analyses as needed.
5. Collaboration:
• BigSheets supports sharing and collaborating on datasets, making it easier for
teams to work together on Big Data projects.
• It allows integration with other IBM tools such as Watson Analytics and
Cognos Analytics for more advanced visualizations and reporting.
6. Visualization:
• Built-in data visualization tools allow users to create charts, graphs, and other
visual representations of their data to help identify trends and patterns.
7. SQL-like Querying:
• Users can run SQL-like queries on Big Data using the BigSheets Query Editor,
providing a familiar environment for data analysis without needing to write
complex code.
UNIT II

HDFS Federation
HDFS Federation is a feature introduced in Hadoop 2.x to enhance the scalability and
performance of the Hadoop Distributed File System (HDFS). It addresses the limitations of
the single NameNode architecture in earlier Hadoop versions, where a single NameNode
could become a bottleneck as the number of files and the size of the cluster grew.

Key Components of HDFS Federation


1. Multiple NameNodes:
• In HDFS Federation, multiple independent NameNodes are deployed in the
cluster.
• Each NameNode manages a portion of the filesystem namespace, referred to
as a namespace volume.
2. Shared DataNodes:
• All DataNodes in the cluster store blocks and serve data for all namespaces.
• DataNodes are shared across all the NameNodes, allowing seamless data
storage and access.
3. Block Pool:
• Each namespace has its own block pool, a collection of blocks managed by a
specific NameNode.
• A block pool is independent, meaning one NameNode's failure does not affect
other namespaces or block pools.

How HDFS Federation Works


1. Namespace Isolation:
• The filesystem namespace is divided into multiple independent namespaces.
• Each NameNode manages its assigned namespace without knowledge of the
others.
2. DataNode Sharing:
• DataNodes do not differentiate between namespaces. They store blocks from
all namespaces and report block information to all associated NameNodes.
3. Clients Access:
• Clients access data through the appropriate NameNode based on the
namespace they need to interact with.
• The file paths in HDFS determine which NameNode to contact.

Benefits of HDFS Federation


1. Scalability:
• By splitting the namespace across multiple NameNodes, the system can
handle a much larger number of files and directories.
2. Performance:
• Workloads are distributed among multiple NameNodes, reducing bottlenecks
and improving performance for large-scale deployments.
3. Fault Isolation:
• If one NameNode fails, only the associated namespace is affected. Other
namespaces managed by different NameNodes remain operational.
4. Flexibility:
• Organizations can allocate specific namespaces for different applications or
departments, providing isolation and better resource management.
5. Simplified Management:
• With independent namespaces, administrators can better organize data and
manage cluster resources.

Limitations of HDFS Federation


1. No Namespace Sharing:
• Namespaces are independent, and data cannot be shared directly between
them. Applications need to handle cross-namespace access.
2. Complexity in Deployment:
• Managing multiple NameNodes requires careful planning and additional
configuration compared to a single NameNode setup.
3. Resource Overhead:
• Each NameNode requires memory and processing resources, which can
increase the resource footprint of the cluster.

Data Replication
• HDFS is designed to reliably store very large files across machines in a large cluster. It
stores each file as a sequence of blocks; all blocks in a file except the last block are
the same size. The blocks of a file are replicated for fault tolerance. The block size
and replication factor are configurable per file. An application can specify the
number of replicas of a file. The replication factor can be specified at file creation
time and can be changed later. Files in HDFS are write-once and have strictly one
writer at any time.

• The NameNode makes all decisions regarding replication of blocks. It periodically


receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster.
Receipt of a Heartbeat implies that the DataNode is functioning properly. A
Blockreport contains a list of all blocks on a DataNode.

• Large HDFS instances run on a cluster of computers that commonly spread across
many racks. For the common case, when the replication factor is three, HDFS’s
placement policy is to put one replica on one node in the local rack, another on a
node in a different (remote) rack, and the last on a different node in the same
remote rack.

Data Disk Failure, Heartbeats and Re-Replication: Each DataNode sends a


Heartbeat message to the NameNode periodically. A network partition can cause a
subset of DataNodes to lose connectivity with the NameNode. The NameNode
detects this condition by the absence of a Heartbeat message. The NameNode
marks DataNodes without recent Heartbeats as dead and does not forward any
new IO requests to them. Any data that was registered to a dead DataNode is not
available to HDFS any more. The NameNode constantly tracks which blocks need to
be replicated and initiates replication whenever necessary. The necessity for re-
replication may arise due to many reasons: a DataNode may become unavailable, a
replica may become corrupted, a hard disk on a DataNode may fail, or the
replication factor of a file may be increased.

Flume
Apache Flume is a robust, distributed system designed for ingesting large volumes of
streaming data into systems like HDFS, Hive, or HBase. It is widely used for collecting,
aggregating, and transferring log data or other event-based data from multiple sources to
Big Data storage solutions.

Key Components of Flume for Data Ingestion:


1. Source:
• Captures incoming data from various data producers, such as log files, HTTP
requests, or social media feeds.
2. Channel:
• Temporarily buffers the data received from the source before delivering it to
the sink.
• Types:
• Memory Channel: High throughput but may lose data on failure.
• File Channel: Reliable, stores data persistently.
3. Sink:
• Sends the buffered data from the channel to the desired destination, such as
HDFS, Hive, HBase, or Kafka.
• Common sinks: HDFS Sink, Hive Sink, ElasticSearch Sink.
4. Agent:
• A combination of source, channel, and sink running as a JVM process.
Multiple agents can be configured in a Flume setup.
Flume is used for Log Aggregation, Social Media Analysis
Sqoop
Apache Sqoop is a tool designed to efficiently transfer bulk data between structured
databases and the Hadoop ecosystem. It is primarily used for importing data from
relational databases into Hadoop for analysis and exporting processed data back to
relational databases.

How Sqoop Facilitates Data Ingestion:


1. Importing Data:
• Transfers data from relational databases to Hadoop components like HDFS,
Hive, or HBase.
• Data can be imported as text, CSV, or binary formats like Avro or Parquet.
2. Exporting Data:
• Moves data from Hadoop back to relational databases after processing or
transformation.
3. Integration:
• Works seamlessly with the Hadoop ecosystem, enabling direct loading into
Hive tables or HBase.
4. Parallelism:
• Uses MapReduce for parallel data transfer, making the process fast and
scalable.
Sqoop is used for Data Warehousing, ETL Pipelines.

Hadoop Archives
Hadoop Archives (HAR) is a feature of Hadoop that aggregates multiple small files into a
single archive to optimize storage and metadata management in HDFS. While not a direct
data ingestion tool like Flume or Sqoop, HAR can be used as part of the ingestion workflow
when managing large datasets with numerous small files.

Why Use HAR in Data Ingestion?


1. Small Files Problem:
• HDFS is optimized for large files, and storing many small files can burden the
NameNode with excessive metadata.
• HAR consolidates these small files, reducing the NameNode's load.
2. Efficient Storage:
• Aggregates files into a single archive, minimizing storage overhead.
3. Improved Performance:
• Faster access to archived files and reduced stress on HDFS for applications
reading large volumes of small files.
Serialization in Hadoop
Serialization in Hadoop refers to the process of converting data objects into a format that
can be stored and transmitted efficiently across a network or written to disk.
Deserialization is the reverse process—converting the serialized data back into objects.
Serialization is crucial in Hadoop as it handles large-scale data storage and transfer
between nodes.

Key Roles of Serialization in Hadoop:


1. Inter-Process Communication:
• Data is serialized when transmitted between nodes in the cluster.
2. Storage:
• Serialized data is stored in HDFS or intermediate data files during MapReduce
operations.
3. Efficiency:
• Optimized serialization formats reduce data size, improving storage and
processing performance.

AVRO
Apache Avro is a popular data serialization framework in the Hadoop ecosystem. It
provides a compact, fast, and efficient way to serialize structured data, with built-in
support for schema evolution and interoperability across programming languages.

Key Features of Avro:


1. Compact and Fast:
• Avro uses a binary data format, making it compact and efficient for storage
and transmission.
2. Schema-Based:
• Each Avro file includes a schema, allowing data to be self-descriptive.
• The schema is defined using JSON.
3. Schema Evolution:
• Avro supports backward and forward compatibility, enabling changes to the
schema without breaking existing data.
4. Interoperability:
• Avro supports multiple programming languages, including Java, Python, C++,
and more.
5. Integration with Hadoop:
• Works seamlessly with HDFS, Hive, and MapReduce.
UNIT III
1. Task Failures
A task failure occurs when a single task (map or reduce) fails to complete its operation
successfully. This is typically caused by issues such as out-of-memory errors, disk space
limitations, or problems with the task's input data.

2. Node Failures
A node failure occurs when a node (or a machine) in the Hadoop cluster becomes unavailable due
to hardware failure, network issues, or node shutdown.

3. JobTracker/ResourceManager Failures
The JobTracker (MapReduce v1) or ResourceManager (in YARN) is responsible for coordinating
job execution and scheduling tasks. If these components fail, the entire job might fail unless
handled properly.

4. Data Locality Failures


Data locality refers to the placement of tasks in relation to the data they need to process. In a
typical MapReduce job, mappers are assigned to process data stored on the local node, to minimize
network overhead.

5. Data Skew Failures (Skewed Processing)


Data skew occurs when some keys in the dataset are heavily concentrated in a few records, leading
to some reducers having much more data than others, potentially causing bottlenecks.

6. Task and Job Timeouts


Hadoop tasks or jobs can fail if they take too long to execute or do not meet the resource
requirements.

7. Data Integrity Failures


If data corruption occurs due to disk failure, network issues, or bad data input, jobs may fail because
they cannot read or process the data.

What is YARN (Yet Another Resource Negotiator)?


YARN (Yet Another Resource Negotiator) is a resource management layer in the Hadoop
ecosystem introduced in Hadoop 2.x. It is designed to overcome the limitations of the original
MapReduce framework, providing a more flexible and efficient way to manage and schedule
resources across a Hadoop cluster.
YARN acts as a cluster manager that handles resource allocation and job scheduling for different
applications running in the Hadoop ecosystem, allowing more than just MapReduce jobs to be
executed. It enables other big data frameworks, such as Apache Spark, Apache Tez, and others, to
run on top of Hadoop.
Key Components of YARN
YARN's architecture is based on three key components:
1. ResourceManager (RM):
• The ResourceManager is the master daemon responsible for managing resources
across the cluster.
• It has two main components:
• Scheduler: Allocates resources to various applications running on the cluster
based on user-defined policies (such as capacity, fairness, etc.).
• ApplicationManager: Manages the lifecycle of applications, ensuring that
they are launched and terminated correctly.
• The ResourceManager communicates with the NodeManager to track available
resources and assign them to running applications.
2. NodeManager (NM):
• The NodeManager is a per-node agent responsible for managing resources on each
individual node.
• It monitors resource usage (e.g., memory, CPU) and ensures that containers (units of
resource allocation) are running as expected.
• The NodeManager also reports node health and resource availability back to the
ResourceManager.
3. ApplicationMaster (AM):
• The ApplicationMaster is specific to each application (job) running in the cluster.
• It is responsible for negotiating resources from the ResourceManager and managing
the execution of the application across the cluster.
• The ApplicationMaster tracks the status of the tasks (Map and Reduce tasks for
MapReduce jobs, for example) and handles task failures or retries.
UNIT IV

Introduction to Apache Pig


Apache Pig is a high-level platform for creating and executing data processing tasks in the Hadoop
ecosystem. It is designed to simplify the development of MapReduce programs and abstract the
complexity of writing low-level code in Java. Pig was developed by Yahoo! to handle large datasets
in a more efficient and flexible way.
Pig uses its own scripting language called Pig Latin, which is similar to SQL and provides a
simpler interface to interact with Hadoop for data analysis and manipulation tasks. It allows
developers to write data transformation pipelines, which can be executed in parallel across a
Hadoop cluster.

Key Features of Apache Pig


1. High-Level Abstraction:
• Pig provides a high-level abstraction for writing data transformation logic. Instead of
writing complex Java code for MapReduce, Pig allows you to write scripts in Pig
Latin, which is simpler and more intuitive.
2. Extensibility:
• Pig supports user-defined functions (UDFs), allowing users to write custom functions
in Java, Python, or other languages to process data. This makes it highly flexible and
extensible for different use cases.
3. Optimization:
• Pig optimizes queries through a series of internal optimizations such as the automatic
generation of MapReduce jobs, making it more efficient in handling large-scale data
processing tasks.
4. Schema Flexibility:
• Pig operates on both structured and semi-structured data. It can process data in
various formats, such as CSV, JSON, and HDFS, without needing a rigid schema
definition like traditional databases.
5. Data Flow Language:
• Pig Latin allows the specification of data flows for data processing tasks. It is a data
flow language that enables easy transformation of data in a sequence of operations.
6. Support for Nested Data:
• Unlike traditional databases that rely on flat data structures, Pig can handle nested
data, such as collections (bags, tuples, and maps), making it suitable for processing
complex data models.
7. Execution Modes:
• Pig can run in two modes:
• Local mode: Pig processes data on a single machine using local file systems.
It is ideal for testing and debugging small datasets.
• MapReduce mode: Pig processes data on a Hadoop cluster by generating
MapReduce jobs, which are distributed across multiple nodes for large-scale
processing.

Introduction to Apache Hive


Apache Hive is a data warehouse infrastructure built on top of Hadoop that provides a high-level
abstraction to query and manage large datasets stored in Hadoop's HDFS (Hadoop Distributed File
System). It allows users to write queries using a language similar to SQL, called HiveQL (or
HQL), making it easier to work with Hadoop for data analysis without needing to write complex
MapReduce code.
Hive was originally developed by Facebook to simplify the process of querying large datasets in
Hadoop. It is now a part of the Apache Software Foundation and is widely used for querying and
managing data in Hadoop ecosystems.

Key Features of Apache Hive


1. SQL-Like Query Language:
• HiveQL is a query language similar to SQL, which allows users to perform
operations like SELECT, JOIN, GROUP BY, and more on large-scale data stored in
Hadoop.
• Users with SQL knowledge can easily adapt to using Hive for querying big data.
2. Data Abstraction:
• Hive abstracts the complexities of working with raw Hadoop MapReduce jobs. Users
interact with Hive through queries, and Hive automatically generates the required
MapReduce jobs behind the scenes.
3. Support for Structured and Semi-Structured Data:
• Hive can process both structured (tables) and semi-structured (e.g., JSON, XML)
data formats, which makes it versatile in terms of the types of data it can handle.
4. Scalability:
• Hive is designed to handle very large datasets distributed across a cluster. It leverages
Hadoop’s scalability to process vast amounts of data efficiently.
5. Extensibility:
• Hive supports user-defined functions (UDFs), which allow users to extend its
functionality by writing custom functions for complex processing in languages like
Java, Python, and others.
6. Metastore:
• The Hive Metastore is a central repository that stores metadata about tables,
partitions, and databases. It helps to manage schema definitions and improve query
optimization.
7. Optimizations:
• Hive provides query optimizations, such as partitioning and bucketing, which enable
efficient data storage and retrieval.
8. Integration with Other Hadoop Ecosystem Tools:
• Hive can be integrated with other big data tools like Apache Pig, Apache HBase,
and Apache Spark, making it a central tool in the Hadoop ecosystem for analytics.

Introduction to Apache HBase


Apache HBase is an open-source, distributed, and scalable NoSQL database built on top of the
Hadoop ecosystem. It is designed to store and manage large amounts of structured and semi-
structured data in a distributed fashion, providing real-time random read and write access to large
datasets. HBase is modeled after Google Bigtable and is well-suited for handling big data
applications, especially where scalability and low-latency access to data are critical.
Unlike traditional relational databases, HBase is column-oriented rather than row-oriented, making
it more efficient for certain types of queries and workloads, such as read-heavy or write-heavy
applications.

Key Features of Apache HBase


1. Distributed and Scalable:
• HBase is built on top of Hadoop and leverages the scalability of the HDFS (Hadoop
Distributed File System) to store data across a cluster of machines. It can handle
petabytes of data and scale horizontally by adding more nodes to the cluster.
2. Real-time Read/Write Access:
• Unlike HDFS, which is optimized for batch processing, HBase provides real-time
access to data, allowing applications to perform fast random reads and writes.
3. Column-Oriented Storage:
• HBase stores data in columns instead of rows, which makes it more suitable for
storing sparse data (i.e., data with many empty fields) and for analytical applications
that focus on columnar access patterns.
4. Schema-less Design:
• HBase tables are schema-less, meaning columns do not need to be predefined. New
columns can be added dynamically, which is useful for applications with evolving
data models.
5. High Availability:
• HBase supports automatic failover and replication across multiple nodes, ensuring
high availability and data durability even in the case of hardware failures.
6. Support for Large Datasets:
• HBase is designed to store extremely large datasets, making it suitable for big data
use cases, such as time-series data, log data, and data for IoT applications.
7. Integrated with Hadoop Ecosystem:
• HBase integrates seamlessly with other Hadoop ecosystem tools like Hive, Pig,
MapReduce, and Spark for processing and querying large datasets stored in HBase.
8. Support for ACID Transactions:
• HBase provides support for ACID (Atomicity, Consistency, Isolation, Durability)
properties on a single row, making it suitable for applications that require
transactional consistency for individual data entries.

Introduction to Big SQL


Big SQL is an advanced SQL query engine developed by IBM for running SQL queries on large-
scale, distributed data systems, particularly in Big Data environments. It is part of the IBM Big
Data suite and is designed to support massive datasets, enabling businesses to leverage the power of
SQL while handling Big Data workloads. Big SQL integrates with Apache Hadoop, IBM DB2, and
other big data technologies, allowing users to run SQL queries directly on data stored across various
platforms, including traditional databases and distributed data systems.
Big SQL simplifies querying of large datasets by providing a powerful SQL interface that supports
both relational and non-relational data, offering compatibility with standard SQL syntax, and
integrating seamlessly with Hadoop-based storage systems like HDFS and Hive.

Key Features of Big SQL


1. SQL on Big Data:
• Big SQL allows users to run traditional SQL queries on data stored in Hadoop and
other distributed storage systems, bridging the gap between traditional relational
databases and big data technologies.
2. Integrated with Hadoop Ecosystem:
• Big SQL integrates with Apache Hive, HDFS, HBase, and other components of the
Hadoop ecosystem. This integration allows users to run SQL queries on Hadoop data
without needing to learn MapReduce or other low-level big data programming
paradigms.
3. High Performance:
• Big SQL is designed for high performance and is optimized for executing complex
queries at scale. It can push down computation to Hadoop frameworks like
MapReduce and Apache Spark, improving query execution times.
4. Supports Mixed Workloads:
• Big SQL is capable of running mixed workloads, including OLAP (Online
Analytical Processing) and OLTP (Online Transaction Processing) queries,
enabling users to perform both real-time and batch analytics on the same platform.
5. Federated Queries:
• It supports federated queries, allowing users to query data across multiple systems
(relational databases, NoSQL databases, and Hadoop clusters) within the same SQL
query. This enables the integration of disparate data sources into a unified analytics
framework.
6. Data Virtualization:
• Big SQL enables data virtualization, meaning data can be queried without physically
moving or duplicating it from one storage system to another. This allows for
querying data across multiple data sources, both structured and unstructured, without
the need to ETL (Extract, Transform, Load) it into a central location.
7. Support for Advanced Analytics:
• Big SQL can integrate with IBM Watson Analytics, IBM SPSS, and other advanced
analytics tools to run complex analytical models and machine learning algorithms
directly on the data in Hadoop.
8. Security and Governance:
• Big SQL provides built-in security features, such as integration with Kerberos
authentication and support for data governance policies, ensuring that data access is
controlled and managed in a secure and compliant manner.

1. Supervised Learning:
• Data: Labeled data (input-output pairs).
• Goal: Predict outcomes for new data based on past examples.
• Examples: Spam detection, house price prediction.
2. Unsupervised Learning:
• Data: Unlabeled data.
• Goal: Find patterns or groupings in the data.
• Examples: Customer segmentation, anomaly detection.
3. Semi-Supervised Learning:
• Data: A mix of labeled and unlabeled data.
• Goal: Improve learning when labeled data is scarce.
• Examples: Image classification with few labeled images.
4. Reinforcement Learning:
• Data: Agent interacts with an environment and receives feedback
(rewards/penalties).
• Goal: Learn actions that maximize cumulative reward over time.
• Examples: Self-driving cars, game playing AI.

You might also like