0% found this document useful (0 votes)
5 views4 pages

Hadoop and MapReduce Notes

Hadoop, developed in 2005, is an open-source framework for distributed storage and processing of large datasets, inspired by Google's technologies. Its core components include HDFS for storage, YARN for resource management, and MapReduce for data processing, supporting various data formats and programming languages. The ecosystem encompasses tools like Hive and Pig, facilitating big data solutions and applications in areas such as log processing and data warehousing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views4 pages

Hadoop and MapReduce Notes

Hadoop, developed in 2005, is an open-source framework for distributed storage and processing of large datasets, inspired by Google's technologies. Its core components include HDFS for storage, YARN for resource management, and MapReduce for data processing, supporting various data formats and programming languages. The ecosystem encompasses tools like Hive and Pig, facilitating big data solutions and applications in areas such as log processing and data warehousing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Hadoop and MapReduce Notes

Hadoop and MapReduce Notes

1. History of Hadoop

- Developed by Doug Cutting and Mike Cafarella in 2005.

- Inspired by Google's File System (GFS) and MapReduce papers.

- Became an Apache open-source project.

2. Apache Hadoop

- A framework for distributed storage and processing of large datasets.

- Open-source and widely adopted in big data applications.

3. Hadoop Distributed File System (HDFS)

- A distributed file system designed for high throughput access to data.

- Stores data across multiple machines.

- Fault-tolerant and scalable.

4. Components of Hadoop

- HDFS: Storage layer.

- YARN: Resource management and job scheduling.

- MapReduce: Processing engine.

- Common: Shared utilities and libraries.

5. Data Format in Hadoop

- Text, Sequence, Avro, Parquet, and ORC formats.

- Optimized for big data processing and compatibility.

6. Analyzing Data with Hadoop

- Data is processed using the MapReduce programming model.

- Useful for batch processing of massive datasets.


7. Scaling Out

- Hadoop scales horizontally by adding more nodes.

- Provides high availability and fault tolerance.

8. Hadoop Streaming

- Allows developers to write MapReduce jobs in any language (e.g., Python, Perl).

- Works via standard input and output.

9. Hadoop Pipes

- C++ library for MapReduce programming.

- Provides performance benefits over streaming for C++ developers.

10. Hadoop Ecosystem

- Includes Hive, Pig, HBase, Sqoop, Flume, Oozie, Zookeeper, Mahout, etc.

- Provides a complete big data solution stack.

11. MapReduce Framework and Basics

- Programming model for processing large datasets in parallel.

- Consists of Map and Reduce functions.

12. How MapReduce Works

- Map function processes input key/value pairs to generate intermediate pairs.

- Shuffle and sort stage groups intermediate data.

- Reduce function processes grouped data to generate output.

13. Developing a MapReduce Application

- Define Mapper and Reducer classes.

- Set job configuration and input/output paths.

14. Unit Tests with MRUnit

- MRUnit is a Java library to test MapReduce applications.

- Helps validate logic without full cluster setup.

15. Test Data and Local Tests


- Use sample data and run jobs in local mode for testing.

- Ensures correctness before deployment.

16. Anatomy of a MapReduce Job Run

- Job submission -> Job initialization -> Task assignment -> Map phase -> Shuffle & sort -> Reduce

phase -> Output.

17. Failures in MapReduce

- Handled by re-running failed tasks on other nodes.

- Speculative execution mitigates slow task issues.

18. Job Scheduling

- YARN handles resource allocation.

- FIFO, Capacity, and Fair schedulers available.

19. Shuffle and Sort

- Intermediate data is shuffled across nodes and sorted by keys.

- Critical for correct and efficient reduce phase.

20. Task Execution

- Mappers and reducers run in containers managed by YARN.

- Monitoring available via ResourceManager UI.

21. MapReduce Types

- Identity Mapper, Chain Mapper, Composite InputFormat.

- Custom types for complex logic.

22. Input Formats

- TextInputFormat, KeyValueTextInputFormat, SequenceFileInputFormat, etc.

23. Output Formats

- TextOutputFormat, SequenceFileOutputFormat, MultipleOutputs.

24. MapReduce Features

- Scalability, fault-tolerance, simplicity, cost-effectiveness.


25. Real-world MapReduce

- Used in log processing, indexing, data warehousing, and sentiment analysis.

You might also like