0% found this document useful (0 votes)
20 views29 pages

BDA (2) Merged

NoSQL, which stands for 'Not Only SQL', is a category of database systems designed to manage large volumes of unstructured, semi-structured, or structured data that traditional relational databases struggle with. Key characteristics include schema flexibility, scalability, high performance, and support for various data models such as key-value, document, column-family, and graph databases. Organizations adopt NoSQL to address modern business needs like handling big data, flexibility in data models, and real-time analytics, making it suitable for applications in social media, e-commerce, and IoT.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views29 pages

BDA (2) Merged

NoSQL, which stands for 'Not Only SQL', is a category of database systems designed to manage large volumes of unstructured, semi-structured, or structured data that traditional relational databases struggle with. Key characteristics include schema flexibility, scalability, high performance, and support for various data models such as key-value, document, column-family, and graph databases. Organizations adopt NoSQL to address modern business needs like handling big data, flexibility in data models, and real-time analytics, making it suitable for applications in social media, e-commerce, and IoT.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

M-3

NoSQL
NoSQL stands for "Not Only SQL." It is a category of database systems designed to handle large
volumes of unstructured, semi-structured, or structured data that tradi onal rela onal databases
(RDBMS) may struggle with.

 Key Characteris cs of NoSQL Databases:

1. Schema Flexibility:

o NoSQL databases are schema-less or have dynamic schemas, allowing changes


to data structure without requiring a redesign.

2. Scalability:

o Built for horizontal scalability, enabling data to be distributed across many


servers.

3. High Performance:

o Op mized for high-speed reads and writes, par cularly for massive datasets.

4. Designed for Distributed Systems:

o NoSQL systems are o en built for cloud or distributed environments, offering


fault tolerance and availability.

5. Support for Specific Data Models:

o NoSQL databases are op mized for non-rela onal data models, such as
CDKG documents, key-value pairs, wide-columns, or graphs.

 Types of NoSQL Databases:

1. Key-Value Stores:

o Data is stored as key-value pairs.

o Examples: Redis, DynamoDB.

2. Document Stores:

o Data is stored as documents, typically in JSON, BSON, or XML format.

o Examples: MongoDB, CouchDB.

3. Column-Family Stores:

o Data is stored in a tabular format with rows and columns, op mized for wide
datasets.

o Examples: Apache Cassandra, HBase.

4. Graph Databases:

o Data is represented as nodes, edges, and proper es for complex rela onships.
o Examples: Neo4j, Amazon Neptune.

NoSQL Business Drivers


Organiza ons adopt NoSQL databases to meet modern business requirements and address
challenges that rela onal databases cannot effec vely handle.
1. Scalability and Big Data:
 Businesses generate and collect massive amounts of data (Big Data).
 NoSQL databases allow for horizontal scaling across commodity hardware, making it possible
to manage and process vast datasets.
2. Flexibility in Data Models:
 Modern applica ons o en deal with varied and rapidly changing data formats.
 NoSQL's schema-less or flexible schema structure supports agile development and
unstructured/semi-structured data.
3. High-Performance Requirements:
 Applica ons such as real- me analy cs, recommenda on engines, and IoT demand high-
speed reads and writes.
 NoSQL databases are op mized for these use cases, delivering low-latency responses.
4. Cloud and Distributed Systems:
 Businesses increasingly deploy applica ons in distributed, cloud-based environments.
 NoSQL databases na vely support distributed architectures, offering fault tolerance and
scalability.
5. Cost Efficiency:
 NoSQL databases can scale out on inexpensive commodity hardware, reducing infrastructure
costs compared to tradi onal RDBMS.
6. Real-Time Analy cs:
 Businesses need ac onable insights from data in real- me.
 NoSQL supports in-memory data processing and analy cs for immediate results.
7. Handling Diverse Use Cases:
 NoSQL supports a variety of applica ons, such as:
o Social media: Managing user profiles, posts, and rela onships.
o E-commerce: Storing catalogs and customer ac vity data.
o IoT: Handling high-velocity sensor data.
o Recommenda on systems: Capturing and analyzing user behavior pa erns.
8. Global Reach:
 NoSQL databases provide geo-distribu on capabili es, enabling applica ons to serve a
global audience with minimal latency.
9. Emerging Technologies

NoSQL Data Architecture Pa erns


NoSQL databases use different data architecture pa erns to address specific applica on needs,
focusing on scalability, performance, and flexibility. The main types of NoSQL data architecture
pa erns are Key-Value Stores, Graph Stores, Column-Family (Bigtable) Stores, and Document Stores.

1. Key-Value Stores
Key-Value stores are the simplest type of NoSQL databases. They store data as a collec on of key-
value pairs.
Architecture:
 Data is stored in a dic onary-like structure where a unique key is associated with a value.
 Keys are used to retrieve values, which can be strings, JSON, binary data, or other complex
objects.
Features:
 Fast Reads and Writes: Op mized for quick lookup by key.
 Schema-less: Values can have arbitrary formats, offering flexibility.
 Scalability: Scales horizontally to handle large amounts of data.
Use Cases:
 Caching: Session storage, user preferences.
 Real- me applica ons: Shopping carts, leaderboards in gaming.
Examples:
 Redis: An in-memory key-value store with support for complex data structures.
 Amazon DynamoDB: A fully managed key-value and document database.

2. Graph Stores
Graph stores are designed for managing and querying rela onships between en es, modeled as
graphs.
Architecture:
 Data is represented as nodes (en es), edges (rela onships), and proper es (metadata).
 Graph traversal algorithms can be used to find pa erns or paths.
Features:
 Efficient Rela onship Queries: Ideal for applica ons requiring deep and frequent rela onship
traversal.
 Flexible Schema: Nodes and edges can have different a ributes.
 Query Languages: Uses graph-specific languages like Cypher (Neo4j) or Gremlin.
Use Cases:
 Social Networks: Modeling user connec ons and interac ons.
 Fraud Detec on: Detec ng suspicious pa erns in financial transac ons.
 Recommenda on Engines: Finding similar users or items.
Examples:
 Neo4j: A popular graph database with a powerful query language (Cypher).
 Amazon Neptune: A graph database for building knowledge graphs and social graphs.

3. Column-Family (Bigtable) Stores


Column-family stores, inspired by Google’s Bigtable, organize data in a tabular format with rows and
columns, where columns are grouped into families.
Architecture:
 Data is stored in a sparse, distributed table where rows have unique keys.
 Each row contains mul ple column families, and each family can have a variable number of
columns.
 Rows are distributed across nodes based on their keys.
Features:
 High Write Throughput: Op mized for write-heavy workloads.
 Wide Rows: Supports a large number of columns per row, making it efficient for me-series
or analy cal data.
 Scalability: Scales horizontally across commodity servers.
Use Cases:
 Time-Series Data: Sensor data, logs.
 Event Data: User clickstreams.
 Recommenda on Systems: Storing and analyzing user-item interac ons.
Examples:
 Apache Cassandra: A distributed column-family store for high availability and scalability.
 HBase: A Bigtable implementa on for Hadoop.

4. Document Stores
Document stores organize data as documents, typically in JSON, BSON, or XML formats, where each
document is self-describing and contains nested fields.
Architecture:
 Data is stored as collec ons of documents.
 Documents can have different structures, making it schema-less.
Features:
 Rich Querying: Supports querying nested fields, indexing, and aggrega on.
 Schema Flexibility: Allows storing semi-structured data.
 Horizontal Scalability: Distributes documents across mul ple nodes.
Use Cases:
 Content Management Systems: Blogs, e-commerce catalogs.
 Real-Time Analy cs: Analyzing user interac ons on websites or apps.
 Mobile Applica ons: Storing user profiles, app data.
Examples:
 MongoDB: A widely-used document database for modern applica ons.
 CouchDB: A database that uses a JSON-based document model with HTTP API access.

NoSQL Solu ons for Big Data


Big Data refers to large volumes of data generated at high velocity and in a variety of formats. NoSQL
databases are specifically designed to manage Big Data challenges due to their scalability, flexibility,
and performance.

Why NoSQL for Big Data?

1. Scalability:

o Horizontal scaling allows systems to handle growing data volumes by adding servers
instead of upgrading hardware.

2. Flexible Schema:

o Adapt to rapidly changing data structures without redesigning the schema.

3. Performance:

o Op mized for high-throughput opera ons, enabling real- me processing.

4. Distributed Architecture:

o Ensures data availability and fault tolerance across mul ple nodes.

5. Handling Unstructured Data:

o Suited for storing and processing logs, images, videos, and other non-tabular data.

Examples of NoSQL Solu ons for Big Data:

 Hadoop + HBase: Combines Hadoop’s distributed file system with HBase’s column-family
storage for scalable Big Data analy cs.

 Apache Cassandra: Handles high-volume, me-series data in distributed environments.


 MongoDB: Manages semi-structured and unstructured data for applica ons requiring real-
me analy cs.

 Elas csearch: Op mized for search and indexing large volumes of textual data.

Understanding the Types of Big Data Problems


Big Data problems can be broadly categorized based on their nature and the challenges they present.

1. Volume Problems

 Defini on: Handling massive amounts of data (terabytes, petabytes, or more).

 Challenges:

o Storage: Efficiently storing huge datasets.

o Scalability: Scaling the system to accommodate growing data.

 NoSQL Solu on: Column-family stores (e.g., Cassandra, HBase) and key-value stores (e.g.,
DynamoDB) for distributed storage.

2. Velocity Problems

 Defini on: Managing data generated at high speed (e.g., IoT sensors, social media streams).

 Challenges:

o Real- me processing: Analyzing data as it arrives.

o Low-latency responses: Suppor ng applica ons like fraud detec on or


recommenda on engines.

 NoSQL Solu on: In-memory databases (e.g., Redis, Memcached) or event-driven document
stores (e.g., MongoDB).

3. Variety Problems

 Defini on: Dealing with diverse data types (structured, semi-structured, unstructured).

 Challenges:

o Integra ng different data formats like JSON, XML, logs, and mul media.

o Querying heterogeneous datasets.

 NoSQL Solu on: Document stores (e.g., MongoDB, Couchbase) for flexibility.

4. Veracity Problems
 Defini on: Ensuring data quality and reliability despite inconsistencies or errors.

 Challenges:

o Cleaning data: Removing duplicates, filling gaps, and handling inconsistencies.

o Data governance: Managing sensi ve or regulated informa on.

 NoSQL Solu on: Graph databases (e.g., Neo4j) for uncovering pa erns and iden fying
inconsistencies.

5. Value Problems

 Defini on: Extrac ng meaningful insights and ac onable intelligence from data.

 Challenges:

o Analy cs: Running complex queries on massive datasets.

o Machine Learning: Feeding data pipelines for model training.

 NoSQL Solu on: Hybrid systems like Elas csearch (for text analy cs) combined with data
lakes (e.g., Hadoop).

What is Shared-Nothing Architecture?


A shared-nothing architecture is a distributed compu ng model where each node operates
independently, without sharing memory or disk storage. Communica on between nodes occurs over
a network, making it highly scalable and fault-tolerant.

Key Features:

1. Distributed Data Storage:

o Data is par oned across mul ple nodes (horizontal scaling).

2. Decentraliza on:

o Nodes are self-sufficient, reducing bo lenecks and single points of failure.

3. Parallel Processing:

o Workloads are distributed, enabling high performance for data-intensive tasks.

Benefits for Big Data Analysis:

1. Scalability:

o Can handle increasing data volumes by adding nodes to the system.

2. Fault Tolerance:

o Node failures affect only a small part of the system.

3. Cost Efficiency:
o Built on commodity hardware, reducing infrastructure costs.

4. High Throughput:

o Parallel processing across independent nodes increases query speed.

Applica ons:

 Used in NoSQL databases, MapReduce frameworks, and data warehouses for analyzing
massive datasets.

Choosing Distribu on Models: Master-Slave vs. Peer-to-Peer


1. Master-Slave Architecture

In a master-slave model, one node (the master) controls the distribu on of tasks and data, while
other nodes (slaves) execute tasks and store data as instructed.

Features:

 Centralized Control: The master node handles coordina on.

 Data Replica on: The master manages data synchroniza on among slaves.

 Consistency: Ensures strong consistency if implemented correctly.

Advantages:

 Simple to implement and maintain.

 Easy to enforce global consistency.

 Ideal for systems with a high read-to-write ra o.

Disadvantages:

 Scalability limits: The master can become a bo leneck.

 Single point of failure: If the master fails, the system may halt unless backups exist.

Use Cases:

 Rela onal Databases: Systems like MySQL in a replicated setup.

 HDFS (Hadoop Distributed File System): Where the NameNode (master) manages DataNodes
(slaves).

2. Peer-to-Peer Architecture

In a peer-to-peer (P2P) model, all nodes are equal and perform both client and server roles. There is
no central master node.

Features:

 Decentraliza on: Nodes communicate directly and share responsibility.


 Par oning and Replica on: Data is par oned and replicated across peers.

Advantages:

 No single point of failure: If one node fails, others con nue to func on.

 High scalability: Nodes can be added dynamically without bo lenecks.

 Fault tolerance: Redundancy ensures resilience.

Disadvantages:

 Complexity: Coordina on between nodes is more challenging.

 Eventual consistency: Strong consistency is harder to achieve in distributed systems.

Use Cases:

 NoSQL Databases: Apache Cassandra and DynamoDB use P2P for distributed opera ons.

 File Sharing Systems: BitTorrent and blockchain networks

NoSQL Systems to Handle Big Data Problems


NoSQL databases are specifically designed to address the challenges of Big Data, leveraging
distributed architectures like shared-nothing and models like master-slave or peer-to-peer.

1. Handling Volume (Large Data Sets)

 System: Apache Cassandra (peer-to-peer)

 Why:

o Highly scalable for storing petabytes of data.

o Efficient par oning and replica on.

2. Handling Velocity (Real-Time Data)

 System: Redis (master-slave)

 Why:

o In-memory data storage enables low-latency responses.

o Suitable for caching and streaming applica ons.


3. Handling Variety (Diverse Data Formats)

 System: MongoDB (master-slave with replica sets)

 Why:

o Schema-less design supports structured, semi-structured, and unstructured data.

o Flexible querying for complex data models.

4. Handling Veracity (Data Accuracy and Quality)

 System: Neo4j (graph-based, master-slave)

 Why:

o Graph databases can iden fy rela onships and inconsistencies in complex datasets.

o Useful for fraud detec on and social network analysis.

5. Handling Value (Extrac ng Insights)

 System: Elas csearch (peer-to-peer)

 Why:

o Full-text search and powerful analy cs.

o Distributed architecture supports large-scale indexing and querying.

MODULE -1

Introduc on to Big Data


Big Data refers to extremely large datasets that are complex, diverse, and grow rapidly, making them
difficult to process and analyze using tradi onal data management tools. These datasets exhibit the 5
V's of Big Data: Volume, Velocity, Variety, Veracity, and Value.

CHARASTERISTICS:-

 Volume: The sheer size of data generated from mul ple sources (e.g., social media, IoT
devices, transac ons).

 Velocity: The speed at which data is generated and processed (e.g., real- me streams).

 Variety: The diverse formats of data, such as structured (databases), semi-structured (JSON,
XML), and unstructured (images, videos).

 Veracity: The reliability and accuracy of data, ensuring meaningful insights.

 Value: Extrac ng ac onable insights to drive business decisions

Big Data in Real-World Applica ons

Big Data has transformed industries by enabling data-driven decisions, enhancing efficiency, and
fostering innova on.

1. Enhanced Decision-Making: Organiza ons analyze vast datasets for informed decisions.
2. Improved Efficiency: Streamlines opera ons and reduces costs.

3. Personalized Experiences: Tailors services to individual preferences.

4. Real-Time Insights: Monitors and reacts to live data streams.

5. Innova on: Uncovers opportuni es for new products and services.

6. Predic ve Maintenance: Prevents equipment failures in industries.

7. Healthcare: Enhances diagnosis and personalized treatments.

8. Smart Ci es: Op mizes urban planning and public services.

9. Risk Management: Iden fies and mi gates risks effec vely.

10. Social Good: Addresses global challenges like disasters and climate change.

Applica ons of Big Data in Various Domains


Big Data finds applica ons across diverse fields, enabling organiza ons to op mize processes,
improve decision-making, and innovate. Below are some key domains where Big Data plays a cri cal
role:

1. Healthcare

 Applica on: Predic ve analy cs for personalized medicine, real- me pa ent monitoring,
and outbreak tracking.

 Impact: Improved pa ent outcomes, reduced costs, and enhanced disease management.

2. Finance

 Applica on: Fraud detec on, algorithmic trading, and credit risk assessment.

 Impact: Enhanced security, be er investment strategies, and streamlined lending processes.

3. Retail and E-Commerce

 Applica on: Customer behavior analysis, dynamic pricing, and inventory management.

 Impact: Improved customer sa sfac on, op mized supply chains, and increased sales.

4. Manufacturing

 Applica on: Predic ve maintenance, quality control, and produc on op miza on.

 Impact: Reduced down me, cost savings, and increased opera onal efficiency.
5. Transporta on and Logis cs

 Applica on: Route op miza on, fleet management, and real- me traffic analysis.

 Impact: Lower opera onal costs, reduced fuel consump on, and improved delivery
efficiency.

6. Educa on

 Applica on: Adap ve learning systems, student performance tracking, and curriculum
development.

 Impact: Personalized educa on, be er resource alloca on, and improved learning
outcomes.

7. Media and Entertainment

 Applica on: Content recommenda on, audience segmenta on, and trend analysis.

 Impact: Enhanced user engagement, targeted adver sing, and data-driven content crea on.

8. Agriculture

 Applica on: Precision farming, crop monitoring, and weather forecas ng.

 Impact: Increased yields, efficient resource use, and reduced environmental impact.

9. Energy and U li es

 Applica on: Smart grid management, energy consump on predic on, and renewable energy
op miza on.

 Impact: Improved energy efficiency, reduced costs, and enhanced sustainability.

10. Public Sector and Governance

 Applica on: Policy planning, fraud detec on in public services, and urban planning.

 Impact: Improved governance, transparency, and efficient public service delivery.

11. Social Media and Marke ng

 Applica on: Sen ment analysis, influencer iden fica on, and targeted campaigns.

 Impact: Be er customer engagement, increased brand loyalty, and higher ROI.


12. Environment and Climate

 Applica on: Monitoring climate pa erns, tracking deforesta on, and disaster predic on.

 Impact: Enhanced environmental conserva on and disaster preparedness.

Types of Big Data


1. Structured Data:

o Organized data in predefined formats, usually stored in rela onal databases or


spreadsheets. Examples include transac on records, customer profiles, and
inventory data.

o Example: SQL databases like MySQL or PostgreSQL.

2. Semi-Structured Data:

o Data that does not follow a strict schema but contains tags or markers to separate
data elements. O en found in JSON, XML, and NoSQL databases.

o Example: Log files, sensor data, and social media posts.

3. Unstructured Data:

o Data that has no predefined structure, making it difficult to analyze using tradi onal
methods. This includes text, audio, video, and images.

o Example: Emails, customer feedback, and mul media content.

4. Streaming Data:

o Con nuous flow of data that is generated in real- me. It is o en processed and
analyzed as it is created to enable immediate ac on.

o Example: Sensor data, stock market feeds, and social media updates.
Tradi onal vs. Big Data Business Approach
Tradi onal Business Approach Big Data Business Approach

 Data Handling:  Data Handling:


 Typically involves rela onal databases  Big Data systems handle vast amounts of
(RDBMS) that store structured data in tables structured, semi-structured, and
and rows. unstructured data, o en leveraging
 Data is processed in batch modes, o en distributed storage systems like Hadoop
requiring human interven on for analysis and NoSQL databases.
and decision-making.  Real- me processing is priori zed, and
 Data Volume: data comes from diverse sources such as
 Businesses mainly deal with smaller datasets IoT devices, social media, and sensors.
that fit into tradi onal database systems.  Data Volume:
Data is collected periodically, and businesses  Big Data businesses handle enormous
generally handle it on a monthly, weekly, or datasets, o en at the petabyte scale, that
daily basis. require specialized tools and
 Analysis: infrastructures for storage, processing, and
 Analy cal methods are determinis c and analysis.
o en rely on historical data to iden fy  Analysis:
trends. Decision-making is based on periodic  Big Data analy cs includes advanced
reports and aggregated data. techniques like machine learning,
 Technology: predic ve analy cs, and real- me data
 Focuses on tradi onal business intelligence analysis.
(BI) tools and analy cs pla orms (e.g., SQL,  Decision-making is o en data-driven, with
OLAP, and spreadsheets) for querying and automated insights generated in real- me
repor ng. or near-real- me.
 Focus:  Technology:
 The emphasis is on historical data, and  Big Data uses technologies like Apache
insights are o en slow to emerge because of Hadoop, Spark, and machine learning
the me it takes to process the data. frameworks. These tools are designed to
scale across distributed environments and
process large volumes of data efficiently.
 Focus:
 The focus is on real- me data, predic ve
analy cs, and dynamic decision-making.
Businesses can react to events as they
happen and forecast future trends with
higher accuracy.
Concept of Hadoop
Hadoop is an open-source framework for processing and storing large datasets in a
distributed compu ng environment. It is designed to handle the demands of Big Data,
providing scalability, fault tolerance, and cost-effec ve processing.
Key Components of Hadoop
1. Hadoop Distributed File System (HDFS):
o HDFS is the storage layer of Hadoop. It breaks large files into smaller chunks
(blocks) and distributes them across mul ple nodes in a cluster. This ensures
high availability and fault tolerance, as data is replicated across different
machines.
2. MapReduce:
o MapReduce is a programming model used for processing large datasets in
parallel across a distributed cluster. It involves two steps:
 Map: Processes input data and produces key-value pairs.
 Reduce: Aggregates and processes the key-value pairs produced by
the Map step.
o This allows for efficient parallel computa on on massive datasets.

3. YARN (Yet Another Resource Nego ator):


o YARN is the resource management layer in Hadoop. It manages resources
across the cluster and schedules tasks, allowing mul ple applica ons to run
concurrently.

4. Hadoop Common:
o This contains the necessary libraries and u li es needed for Hadoop’s core
components to run.

Benefits of Hadoop
1. Scalability:
o Hadoop can scale horizontally by adding more nodes to the cluster. It is
capable of handling petabytes of data.
2. Cost-Effec ve:
o Hadoop uses commodity hardware, which reduces the cost of storing and
processing data. The distributed nature of Hadoop ensures that even large
datasets can be processed efficiently without requiring expensive
infrastructure.

3. Fault Tolerance:
o Hadoop automa cally replicates data blocks across mul ple nodes, ensuring
that if a node fails, data is s ll accessible. This ensures high availability and
system reliability.

4. Flexibility:
o Hadoop can process various types of data—structured, semi-structured, and
unstructured. This makes it suitable for handling diverse Big Data sources like
logs, social media data, sensor data, and more.
5. Real-Time Data Processing:
o Although Hadoop is primarily known for batch processing, it can also
integrate with tools like Apache Spark for real- me data processing, enabling
faster insights.

Use Cases of Hadoop


 E-commerce: Retailers use Hadoop for customer behavior analysis, inventory
management, and recommenda on engines.
 Healthcare: Hadoop helps process and analyze pa ent data, enabling personalized
treatments and predic ve health analy cs.
 Finance: Hadoop is used for fraud detec on, risk management, and real- me stock
market analysis.
 Social Media: Pla orms like Facebook and Twi er use Hadoop to store and analyze
user interac ons, content engagement, and sen ment analysis.

Core Hadoop Components


The core Hadoop components provide the founda on for distributed data processing and
storage in Big Data environments. The key components of Hadoop are:
1. Hadoop Distributed File System (HDFS)
 Purpose: HDFS is the storage layer of Hadoop. It is designed to store large datasets
across mul ple machines in a distributed and fault-tolerant manner.
 How It Works:
o Data Blocks: Data is divided into blocks (typically 128 MB or 256 MB) and
distributed across nodes in a cluster.
o Replica on: Each data block is replicated mul ple mes (typically three
copies) across different nodes to ensure fault tolerance and high availability.
o Master/Slave Architecture: HDFS follows a master-slave architecture, where
the NameNode is the master that manages metadata and file loca ons, and
DataNodes are the slaves that store the actual data.
2. MapReduce
 Purpose: MapReduce is a programming model used to process large datasets in
parallel across a Hadoop cluster. It consists of two main stages: the Map stage and
the Reduce stage.
 How It Works:
o Map: The input data is divided into chunks, and each chunk is processed by a
separate task in parallel. The map func on transforms input data into key-
value pairs.
o Reduce: A er mapping, the key-value pairs are grouped by key and processed
by the reduce func on to produce the final result.
o Fault Tolerance: MapReduce ensures fault tolerance by rerunning failed tasks
on different nodes.
3. YARN (Yet Another Resource Nego ator)
 Purpose: YARN is the resource management layer of Hadoop. It manages and
schedules resources across the cluster for various applica ons running on Hadoop.
 How It Works:
o ResourceManager (RM): The master daemon that manages resources in the
cluster.
o NodeManager (NM): Runs on each node in the cluster and reports the
resources available to the ResourceManager.
o Applica onMaster (AM): Each applica on has its own Applica onMaster,
which is responsible for nego a ng resources from the ResourceManager
and managing the applica on's execu on.
4. Hadoop Common
 Purpose: Hadoop Common includes the libraries, u li es, and APIs that are shared
across the Hadoop ecosystem. These components provide the necessary
func onality for HDFS, MapReduce, and other Hadoop services.
 How It Works:
o It contains common libraries required for Hadoop's core modules and
distributed processing capabili es.

Hadoop Ecosystem
The Hadoop ecosystem consists of a set of tools and frameworks that extend the capabili es
of Hadoop for data processing, management, and analy cs. These components work
together with Hadoop to provide a more comprehensive solu on for Big Data processing.
Some key components of the Hadoop ecosystem are:
1. Apache HBase
 Purpose: A distributed, scalable, and real- me NoSQL database built on top of HDFS.
 How It Works: HBase stores data in tables, which are split into regions, and it allows
fast random access to large datasets. It is used for applica ons that require low-
latency access to data, such as real- me analy cs.
2. Apache Hive
 Purpose: A data warehouse infrastructure built on top of Hadoop that allows users to
query and analyze large datasets using SQL-like queries.
 How It Works: Hive provides a high-level abstrac on for MapReduce, allowing users
to write SQL-like queries (HiveQL), which are then converted into MapReduce jobs
for execu on. It is commonly used for batch processing and querying structured
data.
3. Apache Pig
 Purpose: A high-level pla orm for crea ng MapReduce programs used with Hadoop.
Pig uses a scrip ng language called Pig La n.
 How It Works: Pig simplifies the development of MapReduce programs by allowing
users to write data transforma ons using a simple, procedural language. Pig is used
for batch processing and data transforma on.
4. Apache Spark
 Purpose: A fast, in-memory data processing engine that can process data much faster
than MapReduce.
 How It Works: Spark is designed to perform data processing tasks in memory, which
results in faster computa ons compared to the disk-based MapReduce model. It
supports real- me stream processing, machine learning, and interac ve SQL queries.
 Key Features: It provides APIs for Java, Scala, Python, and R and can be used for a
variety of data processing tasks, including batch processing, real- me streaming, and
itera ve algorithms.
5. Apache Flume
 Purpose: A distributed service for collec ng, aggrega ng, and transpor ng large
amounts of log data into Hadoop.
 How It Works: Flume is used to ingest streaming data into HDFS from various
sources, such as logs, social media, and network devices. It uses an event-driven
architecture to handle large data streams.
6. Apache Sqoop
 Purpose: A tool designed to transfer bulk data between Hadoop and rela onal
databases.
 How It Works: Sqoop facilitates impor ng data from rela onal databases into HDFS
and expor ng data back to databases. It supports a wide range of databases,
including MySQL, Oracle, and SQL Server.
7. Apache Oozie
 Purpose: A workflow scheduler system to manage Hadoop jobs.
 How It Works: Oozie coordinates the execu on of jobs across the Hadoop
ecosystem, including HDFS, MapReduce, Hive, and Pig. It allows users to define
complex workflows that involve mul ple tasks and dependencies.
8. Apache Zookeeper
 Purpose: A centralized service for maintaining configura on informa on, naming,
and providing synchroniza on across distributed applica ons.
 How It Works: Zookeeper helps manage distributed systems and ensures that
services are highly available by coordina ng between the various components of the
Hadoop ecosystem.
9. Apache Mahout
 Purpose: A machine learning library built on top of Hadoop.
 How It Works: Mahout provides scalable machine learning algorithms for clustering,
classifica on, and collabora ve filtering. It is o en used in Big Data environments to
implement machine learning models.
10. Apache Cassandra
 Purpose: A distributed NoSQL database designed for handling large amounts of data
across many commodity servers.
 How It Works: Cassandra is o en used for real- me analy cs and as a backend
database for applica ons that require high availability and scalability. It provides
support for both structured and unstructured data.
Hadoop Ecosystem Workflow
1. Data Inges on: Tools like Apache Flume, Sqoop, or Ka a can be used to ingest data
into the Hadoop ecosystem from external sources such as databases, logs, and
streaming data.
2. Data Storage: Once ingested, data is stored in HDFS, which allows it to be distributed
and replicated across the Hadoop cluster.
3. Data Processing: Processing tasks are carried out using MapReduce, Apache Pig, or
Apache Spark, depending on the specific use case (batch processing, real- me
streaming, etc.).
4. Data Analysis: Users can run queries on the processed data using Hive (SQL-like
queries), Apache Spark SQL, or Apache HBase (for real- me analy cs).
5. Data Output: A er processing, the results can be stored back into HDFS or a NoSQL
database like Cassandra, or exported to external systems.

Distributed File Systems (DFS)


A Distributed File System (DFS) is a system that manages the storage and retrieval of files
across mul ple machines in a distributed environment, while presen ng the users with a
single, unified view of the files. It enables high availability, scalability, and fault tolerance by
distribu ng data across several nodes in a network.
In a DFS, the physical organiza on of compute nodes and the file system structure are
designed to handle the challenges posed by distributed data, such as data locality,
redundancy, and the need for efficient access to data spread across mul ple loca ons.

1. Physical Organiza on of Compute Nodes


In a distributed file system, compute nodes refer to the physical servers or machines that
work together to manage and process data. The organiza on of these nodes plays a cri cal
role in the performance, reliability, and scalability of the system.
Key Elements of Node Organiza on: NRPFC
1. Node Types:
o Data Nodes: These are the machines where actual data is stored. They handle
the storage and retrieval of data blocks.
o Metadata Nodes: These nodes store metadata, which includes informa on
about where data blocks are stored, file names, access permissions, and file
structure. For example, in Hadoop's HDFS, the NameNode stores metadata,
while the DataNodes store the data itself.
2. Replica on:
o To ensure data durability and availability, data is o en replicated across
mul ple compute nodes. In case one node fails, other nodes with the same
data replica can take over, preven ng data loss. The number of replicas is
configurable (e.g., Hadoop HDFS typically replicates data three mes).
3. Par oning and Sharding:
o Files are split into chunks or blocks that are distributed across mul ple
compute nodes. This allows the system to handle large amounts of data more
efficiently.
o Each data block typically has a fixed size (e.g., 128MB in HDFS). The blocks are
distributed across the cluster to achieve load balancing, and each block can
be stored on different nodes based on the system’s configura on.
4. Fault Tolerance:
o Distributed file systems are designed with fault tolerance in mind. Data
replica on ensures that even if some compute nodes or disks fail, the system
can s ll retrieve the data from other replicas. The metadata is also o en
stored in a way that mul ple copies exist, reducing the risk of losing
important informa on.
5. Communica on and Coordina on:
o Nodes communicate with each other to ensure data consistency, especially in
systems with high read and write throughput. In systems like Hadoop, YARN
(Yet Another Resource Nego ator) is responsible for coordina ng resource
management across the cluster.
o Coordina on is also important for maintaining the consistency of the
metadata and ensuring that all nodes are synchronized regarding file
loca ons and updates.

2. Large Scale File-System Organiza on


Large-scale file systems are designed to efficiently store, manage, and process large datasets
spread across a large number of compute nodes. The organiza on of such file systems must
address issues such as scalability, data locality, fault tolerance, and high availability.
Key Aspects of Large Scale File-System Organiza on:
1. Block-based Storage:
o Large-scale file systems, such as HDFS, split files into fixed-size blocks. These
blocks are distributed across different nodes in the cluster.
o This approach allows for parallel processing, as mul ple nodes can access and
work on different blocks of the same file simultaneously.
2. Distributed Metadata Management:
o Metadata Servers (e.g., the NameNode in HDFS) manage the loca on and
proper es of files. Metadata stores informa on such as file names, directory
structure, and the loca on of data blocks across the cluster.
o In very large systems, metadata itself might need to be replicated or
par oned across mul ple nodes to avoid bo lenecks and improve access
speed.
3. Data Replica on and Consistency:
o Data replica on is crucial in large-scale file systems to ensure data availability
and fault tolerance. Each file block can be replicated mul ple mes across
different nodes to prevent data loss in case of node failures.
o In many distributed file systems, consistency models like eventual
consistency or strong consistency are employed, depending on the use case.
Eventual consistency allows for more flexibility but can result in temporary
inconsistencies.
4. Data Locality:
o One of the key considera ons in large-scale file systems is ensuring data
locality. The system tries to keep data close to where it is being processed to
minimize latency.
o For example, in Hadoop, when a MapReduce job is run, it tries to schedule
the processing of data on the nodes where the data blocks are stored,
thereby reducing network traffic and improving performance.
5. Scalability:
o The file system should be able to scale horizontally by adding more nodes to
the cluster. Large-scale distributed file systems are designed to handle
petabytes or even exabytes of data.
o They should also be able to scale in terms of the number of clients accessing
the system. The architecture must handle requests from thousands of clients
simultaneously without performance degrada on.
6. Fault Tolerance and Recovery:
o Large-scale file systems employ techniques like data replica on, checkpoints,
and logging to ensure fault tolerance.
o In the event of a node or disk failure, data can be recovered from other
replicas or backup nodes, ensuring that there is no loss of data.
o For instance, in HDFS, if a data block becomes unavailable due to a node
failure, the system automa cally re-replicates the block from another node
where the replica is stored.
7. Access Control and Security:
o Large-scale distributed file systems typically include mechanisms for access
control to protect sensi ve data. These systems may integrate with Kerberos
or other security frameworks to provide authen ca on and authoriza on for
users.
o Encryp on of data, both in transit and at rest, is a common prac ce in these
systems to secure data against unauthorized access.
DSMS stands for data stream management system. It is nothing but a software application just
like DBMS (database management system) but it involves processing and management of a
continuously flowing data stream rather than static data like Excel PDF or other files. It is generally
used to deal data streams from with various sources which include sensor data, social media fields,
financial reports, etc.

Just like DBMS, DSMS also provides a wide range of operations like storage, processing, analyzing,
integration also helps to generate the visualization and report only used for data streams.

There are wide range of DSMS applications available in the market among them Apache Flint, Apache
Kafka, Apache Storm, Amazon kinesis, etc. DSMS processes 2 types of queries standard queries and
ad hoc queries.

Data stream Management system architecture

DSMS consists of various layer which are dedicated to perform particular operation which are as
follows:

1. Data source Layer

The first layer of DSMS is data source layer as it name suggest it is comprises of all the data sources
which includes sensors, social media feeds, financial market, stock markets etc. In the layer capturing
and parsing of data stream happens. Basically it is the collection layer which collects the data.

2. Data Ingestion Layer

You can consider this layer as bridge between data source layer and processing layer. The main
purpose of this layer is to handle the flow of data i.e., data flow control, data buffering and
data routing.

3. Processing Layer
This layer consider as heart of DSMS architecture it is functional layer of DSMS applications. It
process the data streams in real time. To perform processing it is uses processing engines like Apache
flink or Apache storm etc., The main function of this layer is to filter, transform, aggregate and
enriching the data stream. This can be done by derive insights and detect patterns.

4. Storage Layer

Once data is process we need to store the processed data in any storage unit. Storage layer consist of
various storage like NoSQL database, distributed database etc., It helps to ensure data durability and
availability of data in case of system failure.

5. Querying Layer

As mentioned above it support 2 types of query ad hoc query and standard query. This layer provides
the tools which can be used for querying and analyzing the stored data stream. It also have SQL like
query languages or programming API. This queries can be question like how many entries are done?
which type of data is inserted? etc.,

6. Visualization and Reporting Layer

This layer provides tools for perform visualization like charts, pie chart, histogram etc., On the basis
of this visual representation it also helps to generate the report for analysis.

7. Integration Layer

This layer responsible for integrating DSMS application with traditional system, business intelligence
tools, data warehouses, ML application, NLP applications. It helps to improve already present running
applications.

The layers are responsible for working of DSMS applications. It provides scalable and fault tolerance
application which can handle huge volume of streaming data. These layer can change according to
the business requirements some may include all layer some may exclude layers.

Conclusion

A DSMS architecture would, therefore, open the way for real-time processing of continuous data
streams, components to include data ingestion, stream processing, and query facilities to allow
analyses with low latency and provide instant insights. Such an architecture, in effect, can swiftly
monitor and effectively handle decision-making, and is apt for those applications that have updated
information emanating from a variety of source types.
HDFS Architecture

HDFS is designed to be highly scalable, reliable, and efficient, enabling the storage and processing of
massive datasets. Its architecture consists of several key components:

1. NameNode

2. DataNode

3. Secondary NameNode

4. HDFS Client

5. Block Structure

NameNode

The NameNode is the master server that manages the filesystem namespace and controls access to
files by clients. It performs operations such as opening, closing, and renaming files and directories.
Additionally, the NameNode maps file blocks to DataNodes, maintaining the metadata and the
overall structure of the file system. This metadata is stored in memory for fast access and persisted
on disk for reliability.

Key Responsibilities:

• Maintaining the filesystem tree and metadata.

• Managing the mapping of file blocks to DataNodes.

• Ensuring data integrity and coordinating replication of data blocks.

DataNode

DataNodes are the worker nodes in HDFS, responsible for storing and retrieving actual data blocks as
instructed by the NameNode. Each DataNode manages the storage attached to it and periodically
reports the list of blocks it stores to the NameNode.

Key Responsibilities:

• Storing data blocks and serving read/write requests from clients.

• Performing block creation, deletion, and replication upon instruction from the NameNode.

• Periodically sending block reports and heartbeats to the NameNode to confirm its status.

Secondary NameNode

The Secondary NameNode acts as a helper to the primary NameNode, primarily responsible for
merging the EditLogs with the current filesystem image (FsImage) to reduce the potential load on the
NameNode. It creates checkpoints of the namespace to ensure that the filesystem metadata is up-to-
date and can be recovered in case of a NameNode failure.

Key Responsibilities:

• Merging EditLogs with FsImage to create a new checkpoint.

• Helping to manage the NameNode's namespace metadata.


HDFS Client

The HDFS client is the interface through which users and applications interact with the HDFS. It
allows for file creation, deletion, reading, and writing operations. The client communicates with the
NameNode to determine which DataNodes hold the blocks of a file and interacts directly with the
DataNodes for actual data read/write operations.

Key Responsibilities:

• Facilitating interaction between the user/application and HDFS.

• Communicating with the NameNode for metadata and with DataNodes for data access.

Block Structure

HDFS stores files by dividing them into large blocks, typically 128MB or 256MB in size. Each block is
stored independently across multiple DataNodes, allowing for parallel processing and fault tolerance.
The NameNode keeps track of the block locations and their replicas.

Key Features:

• Large block size reduces the overhead of managing a large number of blocks.

• Blocks are replicated across multiple DataNodes to ensure data availability and fault
tolerance.

HDFS Advantages

HDFS offers several advantages that make it a preferred choice for managing large datasets in
distributed computing environments:

Scalability

HDFS is highly scalable, allowing for the storage and processing of petabytes of data across
thousands of machines. It is designed to handle an increasing number of nodes and storage without
significant performance degradation.

Key Aspects:

• Linear scalability allows the addition of new nodes without reconfiguring the entire system.

• Supports horizontal scaling by adding more DataNodes.

Fault Tolerance

HDFS ensures high availability and fault tolerance through data replication. Each block of data is
replicated across multiple DataNodes, ensuring that data remains accessible even if some nodes fail.

Key Features:

• Automatic block replication ensures data redundancy.

• Configurable replication factor allows administrators to balance storage efficiency and fault
tolerance.

High Throughput
HDFS is optimized for high-throughput access to large datasets, making it suitable for data-intensive
applications. It allows for parallel processing of data across multiple nodes, significantly speeding up
data read and write operations.

Key Features:

• Supports large data transfers and batch processing.

• Optimized for sequential data access, reducing seek times and increasing throughput.

Cost-Effective

HDFS is designed to run on commodity hardware, significantly reducing the cost of setting up and
maintaining a large-scale storage infrastructure. Its open-source nature further reduces the total cost
of ownership.

Key Features:

• Utilizes inexpensive hardware, reducing capital expenditure.

• Open-source software eliminates licensing costs.

Data Locality

HDFS takes advantage of data locality by moving computation closer to where the data is stored. This
minimizes data transfer over the network, reducing latency and improving overall system
performance.

Key Features:

• Data-aware scheduling ensures that tasks are assigned to nodes where the data resides.

• Reduces network congestion and improves processing speed.

Reliability and Robustness

HDFS is built to handle hardware failures gracefully. The NameNode and DataNodes are designed to
recover from failures without losing data, and the system continually monitors the health of nodes to
prevent data loss.

Key Features:

• Automatic detection and recovery from node failures.

• Regular health checks and data integrity verification.

HDFS Use Cases

HDFS is widely used in various industries and applications that require large-scale data processing:

• Big Data Analytics: HDFS is a core component of Hadoop-based big data platforms, enabling
the storage and analysis of massive datasets for insights and decision-making.

• Data Warehousing: Enterprises use HDFS to store and manage large volumes of historical
data for reporting and business intelligence.

• Machine Learning: HDFS provides a robust storage layer for machine learning frameworks,
facilitating the training of models on large datasets.
• Log Processing: HDFS is used to store and process log data from web servers, applications,
and devices, enabling real-time monitoring and analysis.

• Content Management: Media companies use HDFS to store and distribute large multimedia
files, ensuring high availability and efficient access.

You might also like