0% found this document useful (0 votes)
13 views4 pages

BDAunit III

Hadoop is an open-source framework for storing and processing large datasets in a distributed computing environment, utilizing HDFS and MapReduce for efficient data handling. It enables parallel processing and fault tolerance, making it suitable for big data applications. Additionally, NoSQL databases provide flexible, high-performance solutions for managing diverse data types, overcoming the limitations of traditional SQL databases.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views4 pages

BDAunit III

Hadoop is an open-source framework for storing and processing large datasets in a distributed computing environment, utilizing HDFS and MapReduce for efficient data handling. It enables parallel processing and fault tolerance, making it suitable for big data applications. Additionally, NoSQL databases provide flexible, high-performance solutions for managing diverse data types, overcoming the limitations of traditional SQL databases.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

UNIT-III

Defini on: Hadoop is an open-source framework designed for storing and processing large volumes
of data in a distributed compu ng environment. It enables scalable and efficient data handling by
u lizing clusters of commodity hardware. The Hadoop framework allows parallel processing and fault
tolerance, making it a powerful tool for managing big data.

Introduc on: In today’s digital world, the volume of data generated is enormous, requiring systems
that can handle, store, and analyze large datasets efficiently. Tradi onal database management
systems (DBMS) struggle with scalability and performance when dealing with massive amounts of
data. Hadoop was developed as a solu on to these challenges, providing a distributed compu ng
model that processes large datasets across mul ple nodes simultaneously. By leveraging the Hadoop
Distributed File System (HDFS) and the MapReduce processing paradigm, Hadoop ensures high
availability, fault tolerance, and efficient handling of structured and unstructured data.

Processing Data with Hadoop:


Defini on: Processing data with Hadoop refers to the method of handling large-scale datasets using
distributed compu ng across mul ple nodes. Hadoop processes data using HDFS for storage and
MapReduce for processing, ensuring parallel computa on and fault tolerance.

Introduc on: Hadoop provides an efficient way to process large amounts of data by dividing tasks
into smaller sub-tasks and execu ng them in parallel across a distributed system. The Hadoop
processing model includes data inges on, storage, processing, and retrieval. The MapReduce
programming model plays a crucial role in processing data efficiently by breaking it down into two
primary phases: the Map phase and the Reduce phase.

Introduc on to MapReduce Programming:


Defini on: MapReduce is a programming paradigm used in Hadoop to process vast amounts of data
in parallel across a distributed cluster. It simplifies large-scale computa ons by breaking them down
into two main opera ons: mapping and reducing.

Introduc on: The MapReduce framework enables efficient parallel processing of big data by
distribu ng tasks across mul ple compu ng nodes. It consists of two core func ons:

 Map Func on: Processes input data and generates intermediate key-value pairs.

 Reduce Func on: Aggregates and processes the intermediate results to produce the final
output.
Mapper:
Defini on: The Mapper func on is the first phase of the MapReduce framework, responsible for
processing input data and transforming it into key-value pairs.

Introduc on: Each Mapper processes a por on of the input dataset independently and generates
key-value pairs as output. These intermediate key-value pairs are later sorted and passed to the
Reducer. The Mapper func on is highly parallelizable, allowing mul ple Mappers to process data
simultaneously for efficiency.

Reducer:
Defini on: The Reducer func on is the second phase of MapReduce, which processes intermediate
key-value pairs generated by the Mappers and consolidates them into final results.

Introduc on: A er sor ng and grouping the intermediate results, the Reducer applies aggrega on,
computa on, or transforma on opera ons to generate the final output. This phase is responsible for
reducing large amounts of data into meaningful insights.

Combiner:
Defini on: The Combiner is an op onal op miza on step in MapReduce that performs local
aggrega on on Mapper output before it is sent to the Reducer.

Introduc on: By reducing the amount of intermediate data transferred across the network, the
Combiner minimizes data shuffling overhead, improving overall performance and efficiency in
Hadoop processing.

Par oner:
Defini on: The Par oner func on determines how intermediate key-value pairs are distributed to
Reducers in a MapReduce job.

Introduc on: Par oning ensures load balancing by assigning specific key ranges to different
Reducers. This step improves efficiency by preven ng data skew and ensuring even data distribu on.
NoSQL Databases:

Defini on: NoSQL databases are non-rela onal databases designed for distributed data storage and
high-performance opera ons. They handle structured, semi-structured, and unstructured data
efficiently without requiring a fixed schema.

Introduc on: Tradi onal SQL-based databases struggle with scalability and flexibility when dealing
with large datasets. NoSQL databases were developed to overcome these limita ons by offering a
flexible schema, high availability, and horizontal scalability. These databases are widely used in real-
me applica ons, big data analy cs, and distributed environments.

Types of NoSQL Databases:


1. Key-Value Stores: Data is stored in key-value pairs, offering fast lookups and efficient
retrieval. Examples: Redis, DynamoDB.

2. Document Stores: Data is stored in flexible document formats like JSON or BSON. Examples:
MongoDB, CouchDB.

3. Column-Family Stores: Data is stored in column-oriented structures for fast retrieval and
analysis. Examples: Cassandra, HBase.

4. Graph Databases: Data is represented as interconnected nodes and rela onships, suitable
for social networks and recommenda on engines. Examples: Neo4j, ArangoDB.

Advantages of NoSQL:
 Scalability: Horizontally scalable to handle increasing data volumes.

 Schema Flexibility: No fixed schema, allowing for dynamic data models.

 High Availability: Built-in replica on ensures data redundancy and fault tolerance.

 Faster Performance: Op mized for high-speed read/write opera ons.

 Support for Large Data Volumes: Handles big data applica ons efficiently.

Use of NoSQL in Industry:


NoSQL databases are widely used in various industries, including:

 E-commerce: Managing user sessions, shopping carts, and product catalogs.

 Social Media: Storing and analyzing user interac ons, recommenda ons, and messages.

 Finance: Processing real- me transac ons and fraud detec on.

 Healthcare: Managing electronic health records and pa ent data.


 IoT Applica ons: Handling sensor data and real- me analy cs.

SQL vs. NoSQL:


Feature SQL (Rela onal Databases) NoSQL (Non-Rela onal Databases)

Schema Fixed schema Dynamic schema

Scalability Ver cal Scaling Horizontal Scaling

Data Model Tables with Rows and Columns Key-Value, Document, Column, Graph

Transac ons ACID Compliance BASE Model (Eventually Consistent)

Performance Op mized for transac ons Op mized for large-scale reads/writes

NewSQL:
Defini on: NewSQL databases are modern rela onal database systems that combine the scalability
of NoSQL with the strong consistency of tradi onal SQL databases.

Comparison of NoSQL, SQL, and NewSQL:

Feature SQL NoSQL NewSQL

Schema Fixed Flexible Fixed

Scalability Limited High High

Transac ons ACID BASE ACID

Use Case OLTP Big Data & Real-Time Processing Scalable SQL Processing

You might also like