unit v data analytics notes
unit v data analytics notes
Unit V
YARN
Introduction toYARN
Components of YARN
1. ResourceManager (RM)
o Responsibilities: Manages and allocates resources across thecluster. It is thecentral
ResourceManager.
3. ApplicationMaster (AM)
o Responsibilities: E ach application has its own ApplicationMaster that negotiates
encapsulatea collection of resources likeCPU, memory, and storage, and they areused
by applications to executetasks.
Challenges of YARN
1. Complexity: Thearchitectureof YARN is morecomplex than theoriginal Hadoop MapReduce,
requiring moresophisticated management and troubleshooting.
2. ResourceContention: Properly configuring and tuning resourceallocation policies can be
challenging to prevent contention and ensurefair resourcedistribution.
3. Security: E nsuring securecommunication and resourceallocation between various components
of YARN (ResourceManager, NodeManager, ApplicationMaster) is essential.
4. Fault Tolerance: Handling failures efficiently and ensuring that applications can recover
gracefully is a critical aspect of managing a YARN cluster.
5. Monitoring and Debugging: Comprehensivemonitoring and debugging tools arenecessary to
managelarge, dynamic, and diverseworkloads effectively.
6. YARN revolutionizes resourcemanagement and job scheduling in theHadoop ecosystemby
providing a flexible, scalable, and efficient framework. Its components work together to ensure
that resources areutilized effectively across various types of data processing workloads. While
YARN addresses many limitations of theearlier Hadoop architecture, it also introduces new
challenges related to complexity, resourcemanagement, security, fault tolerance, and monitoring,
which must bemanaged to fully leverageits capabilities.
Architecture of YARN
Components of YARN
1. ResourceManager (RM)
queues, etc.
o Uses different policies (FIFO, Capacity Scheduler, Fair Scheduler) to manageresource
distribution.
● ApplicationManager (ASM)
o Manages thelifecycleof applications, fromjob submission to completion.
failure.
2. NodeManager (NM)
NodeManager runs on each nodein thecluster and is responsiblefor managing thenode’ s resources. It
monitors resourceusage(CPU, memory, disk, network) and reports to theResourceManager. Key
responsibilities include:
● Container Management
o Launches and monitors containers as instructed by theApplicationMaster.
● ResourceMonitoring
o Tracks resourceusageby containers and reports to theResourceManager.
3. ApplicationMaster (AM)
E ach application has its own ApplicationMaster, which is a framework-specific entity responsiblefor
negotiating resources with theResourceManager and working with theNodeManagers to executeand
monitor tasks.
● ResourceNegotiation
o Requests resources fromtheResourceManager based on theapplication’ s requirements.
● Task E xecution
o Assigns tasks to containers and monitors their execution.
● Fault Tolerance
o Handles task failures and ensures job completion.
4. Containers
● Task E xecution
o Tasks run within containers, which providethenecessary runtimeenvironment.
ApplicationMaster.
2. ApplicationMaster Initialization
o TheResourceManager allocates a container for theApplicationMaster and launches it on
an availableNodeManager.
3. ResourceNegotiation
o TheApplicationMaster negotiates with theResourceManager to request containers for
executing tasks.
4. Task E xecution
o Containers areallocated by theResourceManager and launched by theNodeManager.
application needs.
2. Scalability
o Designed to handlelarge-scaleclusters, YARN scales horizontally, supporting thousands
distribution.
3. Security
o E nsuring securecommunication and resourceallocation between ResourceManager,
significant challenge.
5. Monitoring and Debugging
o Comprehensivetools arenecessary to monitor and debug large, dynamic workloads
effectively.
retrieval of largedatasets.
o Inverted Indexing: Generating an index whereeach termpoints to its occurrences in the
preferences.
4. Data Warehousing
o E TL (E xtract, Transform, Load): E xtracting data fromvarious sources, transforming it
classification on largedatasets.
o Data Preprocessing: Cleaning and preparing data for machinelearning models, including
of text is counted.
7. Genomics and Bioinformatics
o SequenceAlignment: Aligning DNA sequences to identify similarities and differences,
behaviors.
9. Social Network Analysis
o Graph Processing: Analyzing relationships and interactions in social networks to identify
social graph.
10. Web Data Processing
o Web Crawling and Indexing: Crawling theweb to collect data and creating indexes for
in a cluster.
2. Fault Tolerance
o Automatically handles failures by re-executing failed tasks on different nodes, ensuring the
4. Flexibility
o Supports a widerangeof applications and usecases across different domains.
Hadoop MapReduceis a powerful framework for processing large-scaledata across various applications,
fromlog analysis and data warehousing to machinelearning and genomic research. Despiteits
complexities and limitations, its ability to scaleand handlemassivedatasets makes it an essential tool
in thebig data ecosystem. E ffectiveuseof MapReducerequires understanding its architecture, strengths,
and challenges, enabling organizations to leverageits capabilities for efficient and reliabledata
processing.
Data Serialization
Data serialization is a crucial aspect of data analysis as it involves converting data into a format that can
beeasily stored, transferred, and reconstructed. Here’ s a detailed look at data serialization in thecontext
of data analysis:
o Disadvantages: Inefficient for largedatasets, lacks support for complex data types and
schema.
o Usecases: Quick data exchange, initial data exploration, and data sharing with non-
technical stakeholders.
2. JSON (JavaScript Object Notation)
o Advantages: Human-readable, supports nested data structures.
3. Parquet
o Advantages: Columnar storageformat, efficient for analytical queries, supports compression.
o Usecases: Big data analytics, data warehousing, and any application requiring efficient
read-heavy operations.
4. Avro
o Advantages: Schema-based, compact binary format, supports schema evolution.
o Usecases: Data serialization for big data pipelines, data exchangebetween systems.
5. Feather
o Advantages: Fast read/write, designed for usewith Python and R.
o Disadvantages: Limited support for complex data types compared to Parquet or Avro.
operations.
o Disadvantages: Not as widely supported for interoperability as other formats.
Serialization formats areessential in Big Data for efficient storage, transmission, and processing of
data. Common serialization formats includeJSON, XML, Avro, Parquet, and ORC. E ach format has its
own strengths and is suitablefor specific usecases. Herearedetailed notes on thesecommon
serialization formats:
● Weaknesses:
o Not as efficient in terms of storagesizeand read/writeperformancecompared to binary
formats.
o No built-in schema support for data validation.
● UseCases:
o Web APIs.
o Configuration files.
● E xample:
json
Copy code
{
" name" : " John Doe" ,
" age" : 30,
" isStudent" : false,
" courses" : [" Math" , " Science" ]
}
● Weaknesses:
o Verboseand can lead to largefilesizes.
● UseCases:
o Document storageand exchange(e.g., technical documentation, officedocuments).
● E xample:
xml
Copy code
<person>
<name>John Doe</name>
<age>30</age>
<isStudent>false</isStudent>
<courses>
<course>Math</course>
<course>Science</course>
</courses>
</person>
3. Avro
● Description: A row-oriented remoteprocedurecall and data serialization framework developed
within theApacheHadoop project.
● Strengths:
o Compact binary format leading to efficient storageand processing.
● Weaknesses:
o Less human-readabledueto its binary nature.
● UseCases:
o Data serialization for Hadoop and other Big Data frameworks.
● E xampleSchema:
json
Copy code
{
" type" : " record" ,
" name" : " Person" ,
" fields" : [
{" name" : " name" , " type" : " string" },
{" name" : " age" , " type" : " int" },
{" name" : " isStudent" , " type" : " boolean" },
{" name" : " courses" , " type" : {" type" : " array" , " items" : " string" }}
]
}
4. Parquet
● Description: A columnar storagefileformat optimized for usewith Big Data processing
frameworks.
● Strengths:
o E fficient in terms of storagespaceand read/writeperformancefor largedatasets.
● Weaknesses:
o Less suitablefor row-based operations.
● UseCases:
o Data warehousing and analytics.
● E xample:
o Parquet files arebinary and not typically shown as text, but they can becreated and read
● Weaknesses:
o Not as widely adopted outsidetheHadoop ecosystem.
● UseCases:
o Data warehousing in Hadoop.
● E xample:
o ORC files arebinary and aretypically created and managed using Hadoop-related tools.
Human-
Yes Yes No No No
Readable
Optional
Schema Yes (DTD,
(JSON Yes Yes Yes
Support XSD)
Schema)
Read/Write
Moderate Slow Fast Fast Fast
Performance
3. Apache Thrift
● Schema-based: Uses IDL (InterfaceDefinition Language) to definedata structures.
● Binary format: Compact and efficient.
● Languagesupport: Supports a widerangeof programming languages.
● Servicedefinition: Besides serialization, it also provides tools for building RPC services.
● Usecases: Suitablefor cross-languageservices and data serialization.
4. Apache Parquet
● Columnar storageformat: E fficient for read-heavy operations on largedatasets.
● Schema-based: Stores data along with its schema.
● Optimized for Hadoop: Designed to work well with Hadoop ecosystems, including Spark.
● Compression: Supports various compression methods for efficient storage.
● Usecases: Best for analytical queries wherecolumnar access patterns arecommon.
6. JSON
● Human-readable: Text-based and easily readableby humans.
● Schema-less: Flexible, but can lead to inconsistencies.
● Interoperability: Widely used for web APIs and data interchange.
● Performance: Not as efficient as binary formats in terms of storageand parsing speed.
● Usecases: Great for configuration files, web APIs, and situations wherehuman readability is
important.
7. XML
● Human-readable: Text-based and moreverbosethan JSON.
● Schema support: Can useDTD or XSD to definestructure.
● Interoperability: Widely used for data interchangeand configuration.
● Performance: Less efficient in terms of storageand parsing compared to binary formats.
● Usecases: Suitablefor document-centric data exchange, configuration files, and industry-
specific standards (e.g., SOAP).
8. MessagePack
● Binary format: Moreefficient than JSON but retains theflexibility of schema-less data.
● Compact: Smaller sizecompared to JSON.
● Languagesupport: Supports many programming languages.
● Usecases: Useful for scenarios whereJSON is used but performanceand spaceefficiency are
concerns.
9. Apache Arrow
● Columnar format: Designed for efficient analytics and processing.
● In-memory: Optimized for in-memory storageand operations.
● Interoperability: Facilitates data interchangebetween different data processing systems.
● Usecases: Ideal for in-memory data processing tasks and interoperability between big data
systems.
Key Considerations:
● Schema evolution: Howwell theformat supports changes in data structureover time.
● Performance: Read/writespeed and storageefficiency.
● Interoperability: Compatibility with different systems and languages.
● E aseof use: Complexity of setup and usage.
● Compression: Availability and efficiency of compression methods.