BDA (2) Merged
BDA (2) Merged
NoSQL
NoSQL stands for "Not Only SQL." It is a category of database systems designed to handle large
volumes of unstructured, semi-structured, or structured data that tradi onal rela onal databases
(RDBMS) may struggle with.
1. Schema Flexibility:
2. Scalability:
3. High Performance:
o Op mized for high-speed reads and writes, par cularly for massive datasets.
o NoSQL databases are op mized for non-rela onal data models, such as
CDKG documents, key-value pairs, wide-columns, or graphs.
1. Key-Value Stores:
2. Document Stores:
3. Column-Family Stores:
o Data is stored in a tabular format with rows and columns, op mized for wide
datasets.
4. Graph Databases:
o Data is represented as nodes, edges, and proper es for complex rela onships.
o Examples: Neo4j, Amazon Neptune.
1. Key-Value Stores
Key-Value stores are the simplest type of NoSQL databases. They store data as a collec on of key-
value pairs.
Architecture:
Data is stored in a dic onary-like structure where a unique key is associated with a value.
Keys are used to retrieve values, which can be strings, JSON, binary data, or other complex
objects.
Features:
Fast Reads and Writes: Op mized for quick lookup by key.
Schema-less: Values can have arbitrary formats, offering flexibility.
Scalability: Scales horizontally to handle large amounts of data.
Use Cases:
Caching: Session storage, user preferences.
Real- me applica ons: Shopping carts, leaderboards in gaming.
Examples:
Redis: An in-memory key-value store with support for complex data structures.
Amazon DynamoDB: A fully managed key-value and document database.
2. Graph Stores
Graph stores are designed for managing and querying rela onships between en es, modeled as
graphs.
Architecture:
Data is represented as nodes (en es), edges (rela onships), and proper es (metadata).
Graph traversal algorithms can be used to find pa erns or paths.
Features:
Efficient Rela onship Queries: Ideal for applica ons requiring deep and frequent rela onship
traversal.
Flexible Schema: Nodes and edges can have different a ributes.
Query Languages: Uses graph-specific languages like Cypher (Neo4j) or Gremlin.
Use Cases:
Social Networks: Modeling user connec ons and interac ons.
Fraud Detec on: Detec ng suspicious pa erns in financial transac ons.
Recommenda on Engines: Finding similar users or items.
Examples:
Neo4j: A popular graph database with a powerful query language (Cypher).
Amazon Neptune: A graph database for building knowledge graphs and social graphs.
4. Document Stores
Document stores organize data as documents, typically in JSON, BSON, or XML formats, where each
document is self-describing and contains nested fields.
Architecture:
Data is stored as collec ons of documents.
Documents can have different structures, making it schema-less.
Features:
Rich Querying: Supports querying nested fields, indexing, and aggrega on.
Schema Flexibility: Allows storing semi-structured data.
Horizontal Scalability: Distributes documents across mul ple nodes.
Use Cases:
Content Management Systems: Blogs, e-commerce catalogs.
Real-Time Analy cs: Analyzing user interac ons on websites or apps.
Mobile Applica ons: Storing user profiles, app data.
Examples:
MongoDB: A widely-used document database for modern applica ons.
CouchDB: A database that uses a JSON-based document model with HTTP API access.
1. Scalability:
o Horizontal scaling allows systems to handle growing data volumes by adding servers
instead of upgrading hardware.
2. Flexible Schema:
3. Performance:
4. Distributed Architecture:
o Ensures data availability and fault tolerance across mul ple nodes.
o Suited for storing and processing logs, images, videos, and other non-tabular data.
Hadoop + HBase: Combines Hadoop’s distributed file system with HBase’s column-family
storage for scalable Big Data analy cs.
Elas csearch: Op mized for search and indexing large volumes of textual data.
1. Volume Problems
Challenges:
NoSQL Solu on: Column-family stores (e.g., Cassandra, HBase) and key-value stores (e.g.,
DynamoDB) for distributed storage.
2. Velocity Problems
Defini on: Managing data generated at high speed (e.g., IoT sensors, social media streams).
Challenges:
NoSQL Solu on: In-memory databases (e.g., Redis, Memcached) or event-driven document
stores (e.g., MongoDB).
3. Variety Problems
Defini on: Dealing with diverse data types (structured, semi-structured, unstructured).
Challenges:
o Integra ng different data formats like JSON, XML, logs, and mul media.
NoSQL Solu on: Document stores (e.g., MongoDB, Couchbase) for flexibility.
4. Veracity Problems
Defini on: Ensuring data quality and reliability despite inconsistencies or errors.
Challenges:
NoSQL Solu on: Graph databases (e.g., Neo4j) for uncovering pa erns and iden fying
inconsistencies.
5. Value Problems
Defini on: Extrac ng meaningful insights and ac onable intelligence from data.
Challenges:
NoSQL Solu on: Hybrid systems like Elas csearch (for text analy cs) combined with data
lakes (e.g., Hadoop).
Key Features:
2. Decentraliza on:
3. Parallel Processing:
1. Scalability:
2. Fault Tolerance:
3. Cost Efficiency:
o Built on commodity hardware, reducing infrastructure costs.
4. High Throughput:
Applica ons:
Used in NoSQL databases, MapReduce frameworks, and data warehouses for analyzing
massive datasets.
In a master-slave model, one node (the master) controls the distribu on of tasks and data, while
other nodes (slaves) execute tasks and store data as instructed.
Features:
Data Replica on: The master manages data synchroniza on among slaves.
Advantages:
Disadvantages:
Single point of failure: If the master fails, the system may halt unless backups exist.
Use Cases:
HDFS (Hadoop Distributed File System): Where the NameNode (master) manages DataNodes
(slaves).
2. Peer-to-Peer Architecture
In a peer-to-peer (P2P) model, all nodes are equal and perform both client and server roles. There is
no central master node.
Features:
Advantages:
No single point of failure: If one node fails, others con nue to func on.
Disadvantages:
Use Cases:
NoSQL Databases: Apache Cassandra and DynamoDB use P2P for distributed opera ons.
Why:
Why:
Why:
Why:
o Graph databases can iden fy rela onships and inconsistencies in complex datasets.
Why:
MODULE -1
CHARASTERISTICS:-
Volume: The sheer size of data generated from mul ple sources (e.g., social media, IoT
devices, transac ons).
Velocity: The speed at which data is generated and processed (e.g., real- me streams).
Variety: The diverse formats of data, such as structured (databases), semi-structured (JSON,
XML), and unstructured (images, videos).
Big Data has transformed industries by enabling data-driven decisions, enhancing efficiency, and
fostering innova on.
1. Enhanced Decision-Making: Organiza ons analyze vast datasets for informed decisions.
2. Improved Efficiency: Streamlines opera ons and reduces costs.
10. Social Good: Addresses global challenges like disasters and climate change.
1. Healthcare
Applica on: Predic ve analy cs for personalized medicine, real- me pa ent monitoring,
and outbreak tracking.
Impact: Improved pa ent outcomes, reduced costs, and enhanced disease management.
2. Finance
Applica on: Fraud detec on, algorithmic trading, and credit risk assessment.
Applica on: Customer behavior analysis, dynamic pricing, and inventory management.
Impact: Improved customer sa sfac on, op mized supply chains, and increased sales.
4. Manufacturing
Applica on: Predic ve maintenance, quality control, and produc on op miza on.
Impact: Reduced down me, cost savings, and increased opera onal efficiency.
5. Transporta on and Logis cs
Applica on: Route op miza on, fleet management, and real- me traffic analysis.
Impact: Lower opera onal costs, reduced fuel consump on, and improved delivery
efficiency.
6. Educa on
Applica on: Adap ve learning systems, student performance tracking, and curriculum
development.
Impact: Personalized educa on, be er resource alloca on, and improved learning
outcomes.
Applica on: Content recommenda on, audience segmenta on, and trend analysis.
Impact: Enhanced user engagement, targeted adver sing, and data-driven content crea on.
8. Agriculture
Applica on: Precision farming, crop monitoring, and weather forecas ng.
Impact: Increased yields, efficient resource use, and reduced environmental impact.
9. Energy and U li es
Applica on: Smart grid management, energy consump on predic on, and renewable energy
op miza on.
Applica on: Policy planning, fraud detec on in public services, and urban planning.
Applica on: Sen ment analysis, influencer iden fica on, and targeted campaigns.
Applica on: Monitoring climate pa erns, tracking deforesta on, and disaster predic on.
2. Semi-Structured Data:
o Data that does not follow a strict schema but contains tags or markers to separate
data elements. O en found in JSON, XML, and NoSQL databases.
3. Unstructured Data:
o Data that has no predefined structure, making it difficult to analyze using tradi onal
methods. This includes text, audio, video, and images.
4. Streaming Data:
o Con nuous flow of data that is generated in real- me. It is o en processed and
analyzed as it is created to enable immediate ac on.
o Example: Sensor data, stock market feeds, and social media updates.
Tradi onal vs. Big Data Business Approach
Tradi onal Business Approach Big Data Business Approach
4. Hadoop Common:
o This contains the necessary libraries and u li es needed for Hadoop’s core
components to run.
Benefits of Hadoop
1. Scalability:
o Hadoop can scale horizontally by adding more nodes to the cluster. It is
capable of handling petabytes of data.
2. Cost-Effec ve:
o Hadoop uses commodity hardware, which reduces the cost of storing and
processing data. The distributed nature of Hadoop ensures that even large
datasets can be processed efficiently without requiring expensive
infrastructure.
3. Fault Tolerance:
o Hadoop automa cally replicates data blocks across mul ple nodes, ensuring
that if a node fails, data is s ll accessible. This ensures high availability and
system reliability.
4. Flexibility:
o Hadoop can process various types of data—structured, semi-structured, and
unstructured. This makes it suitable for handling diverse Big Data sources like
logs, social media data, sensor data, and more.
5. Real-Time Data Processing:
o Although Hadoop is primarily known for batch processing, it can also
integrate with tools like Apache Spark for real- me data processing, enabling
faster insights.
Hadoop Ecosystem
The Hadoop ecosystem consists of a set of tools and frameworks that extend the capabili es
of Hadoop for data processing, management, and analy cs. These components work
together with Hadoop to provide a more comprehensive solu on for Big Data processing.
Some key components of the Hadoop ecosystem are:
1. Apache HBase
Purpose: A distributed, scalable, and real- me NoSQL database built on top of HDFS.
How It Works: HBase stores data in tables, which are split into regions, and it allows
fast random access to large datasets. It is used for applica ons that require low-
latency access to data, such as real- me analy cs.
2. Apache Hive
Purpose: A data warehouse infrastructure built on top of Hadoop that allows users to
query and analyze large datasets using SQL-like queries.
How It Works: Hive provides a high-level abstrac on for MapReduce, allowing users
to write SQL-like queries (HiveQL), which are then converted into MapReduce jobs
for execu on. It is commonly used for batch processing and querying structured
data.
3. Apache Pig
Purpose: A high-level pla orm for crea ng MapReduce programs used with Hadoop.
Pig uses a scrip ng language called Pig La n.
How It Works: Pig simplifies the development of MapReduce programs by allowing
users to write data transforma ons using a simple, procedural language. Pig is used
for batch processing and data transforma on.
4. Apache Spark
Purpose: A fast, in-memory data processing engine that can process data much faster
than MapReduce.
How It Works: Spark is designed to perform data processing tasks in memory, which
results in faster computa ons compared to the disk-based MapReduce model. It
supports real- me stream processing, machine learning, and interac ve SQL queries.
Key Features: It provides APIs for Java, Scala, Python, and R and can be used for a
variety of data processing tasks, including batch processing, real- me streaming, and
itera ve algorithms.
5. Apache Flume
Purpose: A distributed service for collec ng, aggrega ng, and transpor ng large
amounts of log data into Hadoop.
How It Works: Flume is used to ingest streaming data into HDFS from various
sources, such as logs, social media, and network devices. It uses an event-driven
architecture to handle large data streams.
6. Apache Sqoop
Purpose: A tool designed to transfer bulk data between Hadoop and rela onal
databases.
How It Works: Sqoop facilitates impor ng data from rela onal databases into HDFS
and expor ng data back to databases. It supports a wide range of databases,
including MySQL, Oracle, and SQL Server.
7. Apache Oozie
Purpose: A workflow scheduler system to manage Hadoop jobs.
How It Works: Oozie coordinates the execu on of jobs across the Hadoop
ecosystem, including HDFS, MapReduce, Hive, and Pig. It allows users to define
complex workflows that involve mul ple tasks and dependencies.
8. Apache Zookeeper
Purpose: A centralized service for maintaining configura on informa on, naming,
and providing synchroniza on across distributed applica ons.
How It Works: Zookeeper helps manage distributed systems and ensures that
services are highly available by coordina ng between the various components of the
Hadoop ecosystem.
9. Apache Mahout
Purpose: A machine learning library built on top of Hadoop.
How It Works: Mahout provides scalable machine learning algorithms for clustering,
classifica on, and collabora ve filtering. It is o en used in Big Data environments to
implement machine learning models.
10. Apache Cassandra
Purpose: A distributed NoSQL database designed for handling large amounts of data
across many commodity servers.
How It Works: Cassandra is o en used for real- me analy cs and as a backend
database for applica ons that require high availability and scalability. It provides
support for both structured and unstructured data.
Hadoop Ecosystem Workflow
1. Data Inges on: Tools like Apache Flume, Sqoop, or Ka a can be used to ingest data
into the Hadoop ecosystem from external sources such as databases, logs, and
streaming data.
2. Data Storage: Once ingested, data is stored in HDFS, which allows it to be distributed
and replicated across the Hadoop cluster.
3. Data Processing: Processing tasks are carried out using MapReduce, Apache Pig, or
Apache Spark, depending on the specific use case (batch processing, real- me
streaming, etc.).
4. Data Analysis: Users can run queries on the processed data using Hive (SQL-like
queries), Apache Spark SQL, or Apache HBase (for real- me analy cs).
5. Data Output: A er processing, the results can be stored back into HDFS or a NoSQL
database like Cassandra, or exported to external systems.
Just like DBMS, DSMS also provides a wide range of operations like storage, processing, analyzing,
integration also helps to generate the visualization and report only used for data streams.
There are wide range of DSMS applications available in the market among them Apache Flint, Apache
Kafka, Apache Storm, Amazon kinesis, etc. DSMS processes 2 types of queries standard queries and
ad hoc queries.
DSMS consists of various layer which are dedicated to perform particular operation which are as
follows:
The first layer of DSMS is data source layer as it name suggest it is comprises of all the data sources
which includes sensors, social media feeds, financial market, stock markets etc. In the layer capturing
and parsing of data stream happens. Basically it is the collection layer which collects the data.
You can consider this layer as bridge between data source layer and processing layer. The main
purpose of this layer is to handle the flow of data i.e., data flow control, data buffering and
data routing.
3. Processing Layer
This layer consider as heart of DSMS architecture it is functional layer of DSMS applications. It
process the data streams in real time. To perform processing it is uses processing engines like Apache
flink or Apache storm etc., The main function of this layer is to filter, transform, aggregate and
enriching the data stream. This can be done by derive insights and detect patterns.
4. Storage Layer
Once data is process we need to store the processed data in any storage unit. Storage layer consist of
various storage like NoSQL database, distributed database etc., It helps to ensure data durability and
availability of data in case of system failure.
5. Querying Layer
As mentioned above it support 2 types of query ad hoc query and standard query. This layer provides
the tools which can be used for querying and analyzing the stored data stream. It also have SQL like
query languages or programming API. This queries can be question like how many entries are done?
which type of data is inserted? etc.,
This layer provides tools for perform visualization like charts, pie chart, histogram etc., On the basis
of this visual representation it also helps to generate the report for analysis.
7. Integration Layer
This layer responsible for integrating DSMS application with traditional system, business intelligence
tools, data warehouses, ML application, NLP applications. It helps to improve already present running
applications.
The layers are responsible for working of DSMS applications. It provides scalable and fault tolerance
application which can handle huge volume of streaming data. These layer can change according to
the business requirements some may include all layer some may exclude layers.
Conclusion
A DSMS architecture would, therefore, open the way for real-time processing of continuous data
streams, components to include data ingestion, stream processing, and query facilities to allow
analyses with low latency and provide instant insights. Such an architecture, in effect, can swiftly
monitor and effectively handle decision-making, and is apt for those applications that have updated
information emanating from a variety of source types.
HDFS Architecture
HDFS is designed to be highly scalable, reliable, and efficient, enabling the storage and processing of
massive datasets. Its architecture consists of several key components:
1. NameNode
2. DataNode
3. Secondary NameNode
4. HDFS Client
5. Block Structure
NameNode
The NameNode is the master server that manages the filesystem namespace and controls access to
files by clients. It performs operations such as opening, closing, and renaming files and directories.
Additionally, the NameNode maps file blocks to DataNodes, maintaining the metadata and the
overall structure of the file system. This metadata is stored in memory for fast access and persisted
on disk for reliability.
Key Responsibilities:
DataNode
DataNodes are the worker nodes in HDFS, responsible for storing and retrieving actual data blocks as
instructed by the NameNode. Each DataNode manages the storage attached to it and periodically
reports the list of blocks it stores to the NameNode.
Key Responsibilities:
• Performing block creation, deletion, and replication upon instruction from the NameNode.
• Periodically sending block reports and heartbeats to the NameNode to confirm its status.
Secondary NameNode
The Secondary NameNode acts as a helper to the primary NameNode, primarily responsible for
merging the EditLogs with the current filesystem image (FsImage) to reduce the potential load on the
NameNode. It creates checkpoints of the namespace to ensure that the filesystem metadata is up-to-
date and can be recovered in case of a NameNode failure.
Key Responsibilities:
The HDFS client is the interface through which users and applications interact with the HDFS. It
allows for file creation, deletion, reading, and writing operations. The client communicates with the
NameNode to determine which DataNodes hold the blocks of a file and interacts directly with the
DataNodes for actual data read/write operations.
Key Responsibilities:
• Communicating with the NameNode for metadata and with DataNodes for data access.
Block Structure
HDFS stores files by dividing them into large blocks, typically 128MB or 256MB in size. Each block is
stored independently across multiple DataNodes, allowing for parallel processing and fault tolerance.
The NameNode keeps track of the block locations and their replicas.
Key Features:
• Large block size reduces the overhead of managing a large number of blocks.
• Blocks are replicated across multiple DataNodes to ensure data availability and fault
tolerance.
HDFS Advantages
HDFS offers several advantages that make it a preferred choice for managing large datasets in
distributed computing environments:
Scalability
HDFS is highly scalable, allowing for the storage and processing of petabytes of data across
thousands of machines. It is designed to handle an increasing number of nodes and storage without
significant performance degradation.
Key Aspects:
• Linear scalability allows the addition of new nodes without reconfiguring the entire system.
Fault Tolerance
HDFS ensures high availability and fault tolerance through data replication. Each block of data is
replicated across multiple DataNodes, ensuring that data remains accessible even if some nodes fail.
Key Features:
• Configurable replication factor allows administrators to balance storage efficiency and fault
tolerance.
High Throughput
HDFS is optimized for high-throughput access to large datasets, making it suitable for data-intensive
applications. It allows for parallel processing of data across multiple nodes, significantly speeding up
data read and write operations.
Key Features:
• Optimized for sequential data access, reducing seek times and increasing throughput.
Cost-Effective
HDFS is designed to run on commodity hardware, significantly reducing the cost of setting up and
maintaining a large-scale storage infrastructure. Its open-source nature further reduces the total cost
of ownership.
Key Features:
Data Locality
HDFS takes advantage of data locality by moving computation closer to where the data is stored. This
minimizes data transfer over the network, reducing latency and improving overall system
performance.
Key Features:
• Data-aware scheduling ensures that tasks are assigned to nodes where the data resides.
HDFS is built to handle hardware failures gracefully. The NameNode and DataNodes are designed to
recover from failures without losing data, and the system continually monitors the health of nodes to
prevent data loss.
Key Features:
HDFS is widely used in various industries and applications that require large-scale data processing:
• Big Data Analytics: HDFS is a core component of Hadoop-based big data platforms, enabling
the storage and analysis of massive datasets for insights and decision-making.
• Data Warehousing: Enterprises use HDFS to store and manage large volumes of historical
data for reporting and business intelligence.
• Machine Learning: HDFS provides a robust storage layer for machine learning frameworks,
facilitating the training of models on large datasets.
• Log Processing: HDFS is used to store and process log data from web servers, applications,
and devices, enabling real-time monitoring and analysis.
• Content Management: Media companies use HDFS to store and distribute large multimedia
files, ensuring high availability and efficient access.