BDAunit III
BDAunit III
Defini on: Hadoop is an open-source framework designed for storing and processing large volumes
of data in a distributed compu ng environment. It enables scalable and efficient data handling by
u lizing clusters of commodity hardware. The Hadoop framework allows parallel processing and fault
tolerance, making it a powerful tool for managing big data.
Introduc on: In today’s digital world, the volume of data generated is enormous, requiring systems
that can handle, store, and analyze large datasets efficiently. Tradi onal database management
systems (DBMS) struggle with scalability and performance when dealing with massive amounts of
data. Hadoop was developed as a solu on to these challenges, providing a distributed compu ng
model that processes large datasets across mul ple nodes simultaneously. By leveraging the Hadoop
Distributed File System (HDFS) and the MapReduce processing paradigm, Hadoop ensures high
availability, fault tolerance, and efficient handling of structured and unstructured data.
Introduc on: Hadoop provides an efficient way to process large amounts of data by dividing tasks
into smaller sub-tasks and execu ng them in parallel across a distributed system. The Hadoop
processing model includes data inges on, storage, processing, and retrieval. The MapReduce
programming model plays a crucial role in processing data efficiently by breaking it down into two
primary phases: the Map phase and the Reduce phase.
Introduc on: The MapReduce framework enables efficient parallel processing of big data by
distribu ng tasks across mul ple compu ng nodes. It consists of two core func ons:
Map Func on: Processes input data and generates intermediate key-value pairs.
Reduce Func on: Aggregates and processes the intermediate results to produce the final
output.
Mapper:
Defini on: The Mapper func on is the first phase of the MapReduce framework, responsible for
processing input data and transforming it into key-value pairs.
Introduc on: Each Mapper processes a por on of the input dataset independently and generates
key-value pairs as output. These intermediate key-value pairs are later sorted and passed to the
Reducer. The Mapper func on is highly parallelizable, allowing mul ple Mappers to process data
simultaneously for efficiency.
Reducer:
Defini on: The Reducer func on is the second phase of MapReduce, which processes intermediate
key-value pairs generated by the Mappers and consolidates them into final results.
Introduc on: A er sor ng and grouping the intermediate results, the Reducer applies aggrega on,
computa on, or transforma on opera ons to generate the final output. This phase is responsible for
reducing large amounts of data into meaningful insights.
Combiner:
Defini on: The Combiner is an op onal op miza on step in MapReduce that performs local
aggrega on on Mapper output before it is sent to the Reducer.
Introduc on: By reducing the amount of intermediate data transferred across the network, the
Combiner minimizes data shuffling overhead, improving overall performance and efficiency in
Hadoop processing.
Par oner:
Defini on: The Par oner func on determines how intermediate key-value pairs are distributed to
Reducers in a MapReduce job.
Introduc on: Par oning ensures load balancing by assigning specific key ranges to different
Reducers. This step improves efficiency by preven ng data skew and ensuring even data distribu on.
NoSQL Databases:
Defini on: NoSQL databases are non-rela onal databases designed for distributed data storage and
high-performance opera ons. They handle structured, semi-structured, and unstructured data
efficiently without requiring a fixed schema.
Introduc on: Tradi onal SQL-based databases struggle with scalability and flexibility when dealing
with large datasets. NoSQL databases were developed to overcome these limita ons by offering a
flexible schema, high availability, and horizontal scalability. These databases are widely used in real-
me applica ons, big data analy cs, and distributed environments.
2. Document Stores: Data is stored in flexible document formats like JSON or BSON. Examples:
MongoDB, CouchDB.
3. Column-Family Stores: Data is stored in column-oriented structures for fast retrieval and
analysis. Examples: Cassandra, HBase.
4. Graph Databases: Data is represented as interconnected nodes and rela onships, suitable
for social networks and recommenda on engines. Examples: Neo4j, ArangoDB.
Advantages of NoSQL:
Scalability: Horizontally scalable to handle increasing data volumes.
High Availability: Built-in replica on ensures data redundancy and fault tolerance.
Support for Large Data Volumes: Handles big data applica ons efficiently.
Social Media: Storing and analyzing user interac ons, recommenda ons, and messages.
Data Model Tables with Rows and Columns Key-Value, Document, Column, Graph
NewSQL:
Defini on: NewSQL databases are modern rela onal database systems that combine the scalability
of NoSQL with the strong consistency of tradi onal SQL databases.
Use Case OLTP Big Data & Real-Time Processing Scalable SQL Processing