Big Data Notes

Big data notes

Uploaded by

souravbumra8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

29 views8 pages

Big Data Notes

Big data notes

Uploaded by

souravbumra8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 8

are using Hadoop in their Organization to deal with big data for eg. Netflix, eBay, etc. The Hadoop Architecture Mainly consists of 4 components. * MapReduce * HDFS(Hadoop distributed File System) * YARN(Yet Another Resource Framework) * Common Utilities or HadoopCommon Let’s understand the role of each one of this component in detail,but just like an Algorithm or a data structure that is based on the YARN sjor feature of MapReduce is to perform the distributed processing in p cluster which Makes Hadoop workingso fast. When you are dealing with Big Data, serial processing is no more of any use. MapReduce has mainly 2 tasks which "are divided phase-wise: In first phase, Map is utilized and in next phase Reduce is utilized. 0G Here, we can see that the Inputs provided to the Map() function then it’s outputs used as an input to the Reduce function and after that, we receive our final output Let's understand What this Map() and Reduce() does. vided to the Map(), now as we are using Big Data. The junction here breaks this DataBlocks into Tuples that are is input to the Reduce() ‘As we can see that an Input is pr Input is a set of Data. The Map() fi ; nothing but a key-value pair. These key-value pairs are now sent as ‘The fteducel) function then combines this broken Tuples or key-value par based on is Key Value and form set of Tuples, and perform some operation lke sorting, summarion type job, ‘etc. which is then sent to the final Output Node Finally, the Output is Obtained Jpalways done in Reducer depending upon the business requirement of [This is How First Mapi) and then Reduce 's utilized one by one.Let's understand the Map Task and Reduce Task in detail Map Task: RecordReader The purpose of recoredreader isto break the records. itis responsible for providing key-value pairs in a Map() function. The key is actually is its locational information and value is the data associated with it. Map: A map is nothing but a user-defined function whose work is to process the Tuples obtained from record reader. The Map() function either does not generate any key-value pair or generate multiple pairs of these tuples. Combiner: Combiner is used for grouping the data in the Map workflow. Its similar toa Local reducer. The intermediate key-value that are generated in the Map is combined with the help of this combiner. Using a combiner is not necessary as it is optional. Partitionar: Partitional is responsible for fetching key-value pairs generated in the Mapper Phases. The partitioner generates the shards corresponding to each reducer. Hashcode of each key is also fetched by this partition. Then partitioner performs it’s(Hashcode) modulus with the number of reducers(key.hashcode()%(number of reducers). Reduce Task Shuffle and Sort: The Task of Reducer starts with this step, the process in which the Mapper generates the intermediate key-value and transfers them to the Reducer task is known as Shuffling. Using the Shuffling process the system can sort the data using its key value. Once some of the Mapping tasks are done Shuffling begins that is why it is a faster process and does not wait for the completion of the task performed by Mapper. Reduce: The main function or task of the Reduce Is to gather the Tuple generated from Map and then perform some sorting and aggregation sort of process on those key-value depending on its key element,* Output! 2 i Dec Once all the operations are performed, the key-value pairs are : into the file with the help of record writer, each record in a newline, and the ‘ey and value in a space-separated manner. K1 = Key1, KK = keyK, VK = ValueK jinput(key, value) input(key, value) pairs pairs =e (ny (KKK) (1.1) (KKK) ech) aad 2. HDFS HDFS(Hadoop Distributed File System) is utilized for storage permission is a Hadoop cluster. It mainly designed for working on commodity Hardware devices inexpensive devices), working ona distributed file system design. HDFSis designed in such a way that it believes more in storing the data ina large chunk of blocks rather than storing small data blocks. HDFS in Hadoop provides Fault-tolerance and High availability to the storage layer and the ‘other devices present in that Hadoop cluster. Data storage Nodes in HDFS. peea NameNode:NameNode works as a Master in a Hadoop cluster that guides the — Hadoop cluster. Meta Data can also be thename of thetfle size, and the information about the . — location(Block number, Block ids) of Datanode that Namenode stores tofind the closest DataNode for Faster Communication. Namenode instructs the DataNodes with the operation like delete, create, Replicate, etc, DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing the data ina Hadoop cluster, the number of DataNodes can be from 1 to 500 or even more than that. The more number of DataNode, the Hadoop cluster will be able to store more data. So itis advised that the DataNode should have High storing capacity to store a large number of file blocks. High Level Architecture Of Hadoop File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the single block of data is divided into multiple blocks of size 128MB which Is default and you can also changeTr Let's understand this concept of breaking down of file in blocks with an exemple. Suppose you have uploaded a file of 400MB to your HDFS then what happens is this file got divided into blocks of 128MB+128M8+128MB+16MB = 400MB size. Means 4 blocks are crested each of 128MB except the last one. Hadoop doesn’t know or it doesn't care about what data is stored in these blocks so it considers the final file Blocks 2s 2 partial record 2s it does not have any idea regarding it. in the Linux file system, the size of 2 file block is about 4&8 which is very much less than the default size of file blocks in the Hadoop file system. AS we all know Hadoop is mainly configured for storing the large size data which is in petabyte, this is what makes Hadoop file system different from other file systems as it can be scaled, nowadays file blocks of 128MB to 2S6MB are considered in Hadoop. Replication In HDFS Replication ensures the availability of the Gata. Replication is making 2 copy of something and the number of times you make 2 copy of that particular thing can Be expressed as it’s Replication Factor. As we have seen in File blocks that the HDFS stores the data in the form of various blocks at the same time Hadoop is also configured to make 3 copy of those file blocks. By default, the Replication Factor for Hadoop is set to 3 which can be configured means you can changeit manually as per your requirement like in above example we have made 4 file blocks which means that 3 Replica or copy of each file block is made means total of 4x3 = 12 blocks are made for the backup purpose. ‘This is because for running Hadoop we are using commodity hardware (inexpensive system hardware) which can be crashed at any time. We are not using the supercomputer for our = + apenassetup. That is why we need such a feature in HDFS which can make copies of that Hadoop: ult tolerance. file blocks for backup purposes, this is known as fal ing so many replica’s of our file blocks brand organization the data is very his extra storage. You can configure Nowone thing we also need to notice that after maki we are wasting so much of our storage but for the big much important than the storage so nobody cares for t the Replication factor in your hdfs-site.xm! file. Rack Awareness The rack is nothing but just the physical collection of nodes in our Hadoop cluster (maybe 30 to 40). A large Hadoop cluster is consists of so many Racks with the help of this Racks information Namenode chooses the closest Datanode to achieve the maximum performance while performing the read/write information which reduces the Network Traffic. HDFS Architecture 3. YARN(Yet Another Resource Negotiator) ‘YARN is a Framework on which MapReduce works. YARN performs 2 operations that are Job scheduling and Resource Management. The Purpose of Job schedular is to divide a big task into small jobs so that each job can be assigned to various slaves in a Hadoop cluster and Processing can be Maximized, Job Scheduler also keeps track of which job is important, which job has more priority, dependencies between the jobs and all the other informationlike job timing, etc. And the use of Resource Manager is to manage all the resources that are made available for running a Hadoop cluster. Features of YARN © Multi-Tenancy © Scalability © Cluster-Utilization © Compatibility 4. Hadoop common or Common Utilities Hadoop common or Common utilities are nothing but our java library and java files or we can say the java scripts that we need for all the other components present in a Hadoop cluster. these utilities are used by HDFS, YARN, and MapReduce for running the cluster. Hadoop Common verify that Hardware failure in a Hadoop cluster is common so it needs to be solved automatically in software by Hadoop Framework. vey =

4 UNIT-4 Introduction To Hadoop
No ratings yet
4 UNIT-4 Introduction To Hadoop
154 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Big Data Unit-2 PPT Part1
No ratings yet
Big Data Unit-2 PPT Part1
76 pages
Module 2
No ratings yet
Module 2
37 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
8 pages
Bda Unit-Iv
No ratings yet
Bda Unit-Iv
37 pages
Big Data Lecture # 05
No ratings yet
Big Data Lecture # 05
22 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
Bda CHP2
No ratings yet
Bda CHP2
105 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Hadoop Platform & Services
No ratings yet
Hadoop Platform & Services
41 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Module-2 PPT-1
No ratings yet
Module-2 PPT-1
126 pages
HADOOP
No ratings yet
HADOOP
19 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Introduction To Hadoop - Chapter-2
No ratings yet
Introduction To Hadoop - Chapter-2
59 pages
BDA Unit 4 PDF
No ratings yet
BDA Unit 4 PDF
31 pages
BDA Notes Unit-4
No ratings yet
BDA Notes Unit-4
86 pages
Lovely Professional University (Lpu) : Mittal School of Business (Msob)
No ratings yet
Lovely Professional University (Lpu) : Mittal School of Business (Msob)
10 pages
BDA Unit - 4
No ratings yet
BDA Unit - 4
16 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Big Data Unit 4 Own
No ratings yet
Big Data Unit 4 Own
18 pages
BDH Unit 3
No ratings yet
BDH Unit 3
25 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
Top 30 Hadoop Interviews Questions Asked by MAANG.
No ratings yet
Top 30 Hadoop Interviews Questions Asked by MAANG.
28 pages
BDA-Unit 4
No ratings yet
BDA-Unit 4
20 pages
Unit 2
No ratings yet
Unit 2
56 pages
Hadoop Frame Work
No ratings yet
Hadoop Frame Work
38 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Unit-2 - Hadoop2
No ratings yet
Unit-2 - Hadoop2
30 pages
Unit-2 CH 1 Updated
No ratings yet
Unit-2 CH 1 Updated
22 pages
BDA Unit 1
No ratings yet
BDA Unit 1
35 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
14 pages
Chapter3 HDFS MapReduce YARN
No ratings yet
Chapter3 HDFS MapReduce YARN
35 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
Bda Unit 2
No ratings yet
Bda Unit 2
21 pages
Lecture 2
No ratings yet
Lecture 2
28 pages
Fbda Unit-3
No ratings yet
Fbda Unit-3
27 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
44 pages
bdcc-2 2
No ratings yet
bdcc-2 2
12 pages
Big Data
No ratings yet
Big Data
67 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
Hadoop Architec
No ratings yet
Hadoop Architec
14 pages
Sem 7 - COMP - BDA
No ratings yet
Sem 7 - COMP - BDA
16 pages
HDFS
No ratings yet
HDFS
8 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
Data Science
No ratings yet
Data Science
14 pages
DM Hadoop Architecture
No ratings yet
DM Hadoop Architecture
6 pages
Hadoop
No ratings yet
Hadoop
7 pages
Hadoop
No ratings yet
Hadoop
4 pages

Big Data Notes

Uploaded by

Big Data Notes

Uploaded by

You might also like