0% found this document useful (0 votes)
9 views

unit v data analytics notes

Uploaded by

amulya28023403
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

unit v data analytics notes

Uploaded by

amulya28023403
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

DATA ANALYTICS

Unit V

YARN

Introduction toYARN

YARN (Yet Another ResourceNegotiator) is a corecomponent of ApacheHadoop that enhances the


resourcemanagement and job scheduling capabilities of Hadoop. It allows multipledata processing
engines, such as batch processing, interactiveprocessing, streamprocessing, and more, to run and
process data stored in Hadoop.

YARN is designed to providea moreflexibleand efficient resourcemanagement framework, enabling


better cluster utilization and scalability.

Components of YARN
1. ResourceManager (RM)
o Responsibilities: Manages and allocates resources across thecluster. It is thecentral

authority that arbitrates resources among all applications in thesystem.


o Components:

● Scheduler: Allocates resources to various running applications based on defined

constraints likecapacity, queues, etc. It does not monitor or track thestatus of


applications.
● ApplicationManager: Manages thelifecycleof applications, accepting job

submissions, negotiating thefirst container for executing theapplication-


specific ApplicationMaster, and restarting theApplicationMaster on failure.
2. NodeManager (NM)
o Responsibilities: Manages resources on a singlenode, monitoring resourceusage(CPU,

memory, disk, network) and reporting it to theResourceManager. It is also responsible


for managing thelifecycleof containers and monitoring their resourceusage.
o Components:

● Container: A collection of physical resources (CPU cores, memory) on a single

node. It is thebasic unit of resourceallocation in YARN.


● ResourceMonitoring: Tracks and reports resourceusageon each nodeto the

ResourceManager.
3. ApplicationMaster (AM)
o Responsibilities: E ach application has its own ApplicationMaster that negotiates

resources with theResourceManager and works with theNodeManager(s) to executeand


monitor tasks. It handles theapplication-specific logic of job execution, failurehandling,
and communication with theResourceManager.
4. Containers
o Responsibilities: Thefundamental unit of processing capacity in YARN. Containers

encapsulatea collection of resources likeCPU, memory, and storage, and they areused
by applications to executetasks.

Needs and Challenges of YARN

Needs Addressed by YARN


1. ResourceUtilization: YARN enables moreefficient useof cluster resources by allowing multiple
types of data processing engines to sharea common resourcepool.
2. Scalability: YARN is designed to scaleout to support largeclusters with thousands of nodes,
handling diverseworkloads.
3. Flexibility: Supports different processing models (batch, interactive, real-time) within thesame
cluster, enhancing theHadoop ecosystem' s versatility.
4. Improved ResourceManagement: With a dedicated ResourceManager, YARN provides better
resourceallocation and scheduling compared to theolder Hadoop MapReducearchitecture.

Challenges of YARN
1. Complexity: Thearchitectureof YARN is morecomplex than theoriginal Hadoop MapReduce,
requiring moresophisticated management and troubleshooting.
2. ResourceContention: Properly configuring and tuning resourceallocation policies can be
challenging to prevent contention and ensurefair resourcedistribution.
3. Security: E nsuring securecommunication and resourceallocation between various components
of YARN (ResourceManager, NodeManager, ApplicationMaster) is essential.
4. Fault Tolerance: Handling failures efficiently and ensuring that applications can recover
gracefully is a critical aspect of managing a YARN cluster.
5. Monitoring and Debugging: Comprehensivemonitoring and debugging tools arenecessary to
managelarge, dynamic, and diverseworkloads effectively.
6. YARN revolutionizes resourcemanagement and job scheduling in theHadoop ecosystemby
providing a flexible, scalable, and efficient framework. Its components work together to ensure
that resources areutilized effectively across various types of data processing workloads. While
YARN addresses many limitations of theearlier Hadoop architecture, it also introduces new
challenges related to complexity, resourcemanagement, security, fault tolerance, and monitoring,
which must bemanaged to fully leverageits capabilities.

Dissecting YARN in the YARN Framework

YARN (Yet Another ResourceNegotiator) serves as theresourcemanagement layer of theHadoop


ecosystem, fundamentally enhancing its ability to handlevarious data processing tasks. Below, we
dissect YARN, exploring its architecture, components, and operational workflow.

Architecture of YARN

YARN’ s architecturedecouples resourcemanagement fromjob scheduling and monitoring, allowing it to


support a variety of processing frameworks (e.g., MapReduce, Spark, Tez). Thekey components are:
1. ResourceManager (RM)
2. NodeManager (NM)
3. ApplicationMaster (AM)
4. Containers

Components of YARN

1. ResourceManager (RM)

TheResourceManager is themaster daemon of YARN, responsiblefor resourceallocation and


management across thecluster. It has two main components:
● Scheduler
o Allocates resources to various running applications based on constraints likecapacity,

queues, etc.
o Uses different policies (FIFO, Capacity Scheduler, Fair Scheduler) to manageresource

distribution.
● ApplicationManager (ASM)
o Manages thelifecycleof applications, fromjob submission to completion.

o Negotiates thefirst container for executing theApplicationMaster and restarts it upon

failure.

2. NodeManager (NM)

NodeManager runs on each nodein thecluster and is responsiblefor managing thenode’ s resources. It
monitors resourceusage(CPU, memory, disk, network) and reports to theResourceManager. Key
responsibilities include:
● Container Management
o Launches and monitors containers as instructed by theApplicationMaster.

● ResourceMonitoring
o Tracks resourceusageby containers and reports to theResourceManager.

3. ApplicationMaster (AM)
E ach application has its own ApplicationMaster, which is a framework-specific entity responsiblefor
negotiating resources with theResourceManager and working with theNodeManagers to executeand
monitor tasks.
● ResourceNegotiation
o Requests resources fromtheResourceManager based on theapplication’ s requirements.

● Task E xecution
o Assigns tasks to containers and monitors their execution.

● Fault Tolerance
o Handles task failures and ensures job completion.

4. Containers

Containers arethefundamental unit of resourceallocation in YARN. They encapsulateresources like


CPU, memory, and storagerequired for executing a task.
● ResourceAllocation
o Containers areallocated by theResourceManager and managed by theNodeManager.

● Task E xecution
o Tasks run within containers, which providethenecessary runtimeenvironment.

Operational Workflowof YARN


1. Application Submission
o Theclient submits an application to theResourceManager, specifying the

ApplicationMaster.
2. ApplicationMaster Initialization
o TheResourceManager allocates a container for theApplicationMaster and launches it on

an availableNodeManager.
3. ResourceNegotiation
o TheApplicationMaster negotiates with theResourceManager to request containers for

executing tasks.
4. Task E xecution
o Containers areallocated by theResourceManager and launched by theNodeManager.

o TheApplicationMaster assigns tasks to thesecontainers and monitors their execution.

5. Progress and Status Reporting


o TheApplicationMaster periodically updates theResourceManager with theapplication’ s

progress and status.


6. Completion and Cleanup
o Upon task completion, theApplicationMaster notifies theResourceManager, releases

resources, and terminates.

Needs Addressed by YARN


1. E fficient ResourceUtilization
o YARN enables better cluster utilization by dynamically allocating resources based on

application needs.
2. Scalability
o Designed to handlelarge-scaleclusters, YARN scales horizontally, supporting thousands

of nodes and diverseworkloads.


3. Flexibility
o Supports various data processing models (batch, interactive, real-time) within thesame

cluster, enhancing theHadoop ecosystem’ s versatility.


4. Improved ResourceManagement
o Theseparation of resourcemanagement fromjob execution allows for moresophisticated

and efficient resourcescheduling.


Challenges in YARN
1. Complexity
o Thearchitectureis morecomplex compared to theolder Hadoop MapReduce, requiring

advanced management and troubleshooting skills.


2. ResourceContention
o Proper configuration and tuning areessential to prevent resourcecontention and ensurefair

distribution.
3. Security
o E nsuring securecommunication and resourceallocation between ResourceManager,

NodeManager, and ApplicationMaster components is critical.


4. Fault Tolerance
o E fficient handling of failures and ensuring that applications can recover gracefully is a

significant challenge.
5. Monitoring and Debugging
o Comprehensivetools arenecessary to monitor and debug large, dynamic workloads

effectively.

YARN is a robust resourcemanagement framework that significantly enhances Hadoop’ s capabilities,


making it moreflexible, scalable, and efficient. By decoupling resourcemanagement fromjob scheduling,
YARN supports various processing models and ensures better resourceutilization across thecluster.
However, its complexity, resourcecontention, security, fault tolerance, and monitoring challenges require
careful management to fully leverageits benefits.

Map reduce Applications

MapReduce Applications in Hadoop

MapReduceis a programming model and processing techniqueassociated with theHadoop ecosystem,


designed to process largevolumes of data in a distributed and parallel manner. Belowarekey applications
and usecases of Hadoop MapReduce:

Applications of Hadoop MapReduce


1. Data Analysis and Transformation
o Log Analysis: Processing large-scalelog files to extract useful information likeerror

patterns, usagestatistics, etc.


o Data Cleaning and Transformation: Converting rawdata into structured formats, handling

missing values, and performing data enrichment.


2. Search Indexing
o Building Search Indexes: Creating indexes for search engines, enabling fast search and

retrieval of largedatasets.
o Inverted Indexing: Generating an index whereeach termpoints to its occurrences in the

dataset, which is fundamental for search engines.


3. Recommendation Systems
o CollaborativeFiltering: Generating product or content recommendations based on user

behavior, such as purchasehistory or viewing patterns.


o Content-Based Filtering: Recommending items based on itemattributes and user

preferences.
4. Data Warehousing
o E TL (E xtract, Transform, Load): E xtracting data fromvarious sources, transforming it

into a suitableformat, and loading it into data warehouses.


o OLAP (OnlineAnalytical Processing): Performing complex queries and analysis on large

datasets to support business decision-making.


5. MachineLearning
o Training Models: Implementing algorithms likek-means clustering, linear regression, and

classification on largedatasets.
o Data Preprocessing: Cleaning and preparing data for machinelearning models, including

tasks likenormalization, featureextraction, and sampling.


6. Text Processing
o Sentiment Analysis: Analyzing largevolumes of text data to determinesentiment, often

used in social media analysis.


o Word Count: A classic MapReduceexamplewherethefrequency of words in a largecorpus

of text is counted.
7. Genomics and Bioinformatics
o SequenceAlignment: Aligning DNA sequences to identify similarities and differences,

essential in genetic research.


o Genomic Data Processing: Analyzing large-scalegenomic datasets for insights into

genetic variations and diseasepatterns.


8. Financial Services
o Risk Management: Analyzing largedatasets to identify and mitigatefinancial risks.

o Fraud Detection: Detecting fraudulent activities by analyzing transaction patterns and

behaviors.
9. Social Network Analysis
o Graph Processing: Analyzing relationships and interactions in social networks to identify

influential users, communities, and trends.


o Friend Recommendations: Suggesting newconnections to users based on their existing

social graph.
10. Web Data Processing

o Web Crawling and Indexing: Crawling theweb to collect data and creating indexes for

efficient search and retrieval.


o ClickstreamAnalysis: Analyzing user navigation patterns on websites to understand
behavior and improveuser experience.

Key Advantages of Hadoop MapReduce


1. Scalability
o Can handlepetabytes of data by distributing theprocessing across a largenumber of nodes

in a cluster.
2. Fault Tolerance
o Automatically handles failures by re-executing failed tasks on different nodes, ensuring the

reliability of theprocessing pipeline.


3. Cost E fficiency
o Utilizes commodity hardware, reducing thecost of data processing infrastructure.

4. Flexibility
o Supports a widerangeof applications and usecases across different domains.

Challenges of Hadoop MapReduce


1. Complexity in Programming
o Requires writing customcodein Java or other supported languages, which can becomplex

and time-consuming for non-trivial tasks.


2. Latency
o Not suitablefor real-timedata processing dueto its batch processing nature, leading to

higher latency compared to streamprocessing frameworks.


3. ResourceManagement
o E fficient resourceallocation and management arecrucial to prevent resourcecontention

and ensureoptimal performance.


4. Debugging and Monitoring
o Requires robust tools for debugging, monitoring, and tuning performance, especially in
large-scaledeployments.

Hadoop MapReduceis a powerful framework for processing large-scaledata across various applications,
fromlog analysis and data warehousing to machinelearning and genomic research. Despiteits
complexities and limitations, its ability to scaleand handlemassivedatasets makes it an essential tool
in thebig data ecosystem. E ffectiveuseof MapReducerequires understanding its architecture, strengths,
and challenges, enabling organizations to leverageits capabilities for efficient and reliabledata
processing.

Data Serialization

Data serialization is a crucial aspect of data analysis as it involves converting data into a format that can
beeasily stored, transferred, and reconstructed. Here’ s a detailed look at data serialization in thecontext
of data analysis:

Importance of Data Serialization in Data Analysis


1. E fficiency: Serialized data formats often reducethesizeof data, making it faster to read from
and writeto disk, as well as to transfer over networks.
2. Compatibility: E nables thesharing of data between different systems and applications, even if
they arewritten in different programming languages.
3. Persistence: Serialized data can bestored on disk and later read back into memory, allowing for
long-termstorageof analysis results.
4. Performance: E fficient serialization formats can significantly speed up thedata loading and
saving processes, which is critical when working with largedatasets.

Common Serialization Formats in Data Analysis


1. CSV (Comma-Separated Values)
o Advantages: Simple, human-readable, and widely supported.

o Disadvantages: Inefficient for largedatasets, lacks support for complex data types and

schema.
o Usecases: Quick data exchange, initial data exploration, and data sharing with non-

technical stakeholders.
2. JSON (JavaScript Object Notation)
o Advantages: Human-readable, supports nested data structures.

o Disadvantages: Larger filesizecompared to binary formats, slower read/writeperformance.

o Usecases: Configuration files, web APIs, data interchangebetween services.

3. Parquet
o Advantages: Columnar storageformat, efficient for analytical queries, supports compression.

o Disadvantages: Not human-readable, schema evolution can becomplex.

o Usecases: Big data analytics, data warehousing, and any application requiring efficient

read-heavy operations.
4. Avro
o Advantages: Schema-based, compact binary format, supports schema evolution.

o Disadvantages: Requires schema management.

o Usecases: Data serialization for big data pipelines, data exchangebetween systems.

5. Feather
o Advantages: Fast read/write, designed for usewith Python and R.

o Disadvantages: Limited support for complex data types compared to Parquet or Avro.

o Usecases: Quick data interchangebetween Python and R, in-memory data analysis.

6. HDF5 (Hierarchical Data Format)


o Advantages: Suitablefor largedatasets, supports complex data structures, efficient I/O

operations.
o Disadvantages: Not as widely supported for interoperability as other formats.

o Usecases: Scientific computing, large-scaledata storage, multidimensional data analysis.


Key Considerations in Choosing a Serialization Format
1. Data Size: Largedatasets benefit morefrombinary formats (e.g., Parquet, Avro) dueto better
compression and I/O performance.
2. Data Complexity: Formats likeJSON and Avro arebetter suited for complex nested data
structures.
3. Interoperability: CSV and JSON arewidely supported across different tools and languages,
making themsuitablefor data interchange.
4. Performance: For high-performanceneeds, formats likeParquet and Feather provideefficient
read/writecapabilities.
5. Schema E volution: If theschema is expected to changeover time, formats likeAvro that
support schema evolution areadvantageous.
6. E aseof Use: Human-readableformats likeCSV and JSON areeasier to useand debug, but
may not besuitablefor all performanceneeds.

Serialization and Data Analysis Workflow


1. Data Collection: Data is collected and often serialized for storageor transmission.
2. Data Preprocessing: Serialized data is deserialized into a suitableformat for analysis (e.g., a
DataFramein Python).
3. Data Analysis: Analytical operations areperformed on thedeserialized data.
4. Result Storage: Analysis results areoften serialized for storageor further processing.
5. Data Sharing: Serialized data is shared between different systems or teammembers.

Tools and Libraries


● Pandas: Provides support for reading/writing CSV, JSON, Parquet, and other formats.
● PyArrow: Offers efficient read/writefor Parquet and Feather formats.
● H5py: Allows working with HDF5 format in Python.
● fastavro: Library for working with Avro data in Python.

By carefully choosing theappropriateserialization format based on thespecific requirements of your data


analysis tasks, you can optimizetheperformance, efficiency, and interoperability of your data processing
workflows.

Working with Common Serialization Formats

Serialization formats areessential in Big Data for efficient storage, transmission, and processing of
data. Common serialization formats includeJSON, XML, Avro, Parquet, and ORC. E ach format has its
own strengths and is suitablefor specific usecases. Herearedetailed notes on thesecommon
serialization formats:

1. JSON (JavaScript Object Notation)


● Description: A lightweight, text-based, language-independent data interchangeformat.
● Strengths:
o Human-readableand easy to write.

o Widely used in web applications and APIs.

o Supports hierarchical data structures (objects and arrays).

● Weaknesses:
o Not as efficient in terms of storagesizeand read/writeperformancecompared to binary

formats.
o No built-in schema support for data validation.

● UseCases:
o Web APIs.

o Configuration files.

o Data interchangebetween systems.

● E xample:
json
Copy code
{
" name" : " John Doe" ,
" age" : 30,
" isStudent" : false,
" courses" : [" Math" , " Science" ]
}

2. XML (eXtensibleMarkup Language)


● Description: A markup languagethat defines a set of rules for encoding documents in a format
that is both human-readableand machine-readable.
● Strengths:
o Highly flexibleand extensible.

o Supports complex hierarchical structures and mixed content.

o Robust schema validation (DTD, XSD).

● Weaknesses:
o Verboseand can lead to largefilesizes.

o Parsing and processing can beslower compared to other formats.

● UseCases:
o Document storageand exchange(e.g., technical documentation, officedocuments).

o Systems requiring strict data validation.

● E xample:

xml
Copy code
<person>
<name>John Doe</name>
<age>30</age>
<isStudent>false</isStudent>
<courses>
<course>Math</course>
<course>Science</course>
</courses>
</person>

3. Avro
● Description: A row-oriented remoteprocedurecall and data serialization framework developed
within theApacheHadoop project.
● Strengths:
o Compact binary format leading to efficient storageand processing.

o Supports schema evolution, making it suitablefor long-termdata storage.

o Good integration with theHadoop ecosystem.

● Weaknesses:
o Less human-readabledueto its binary nature.

o Schema definition is required.

● UseCases:
o Data serialization for Hadoop and other Big Data frameworks.

o E fficient storageand transmission of largedatasets.

● E xampleSchema:

json
Copy code
{
" type" : " record" ,
" name" : " Person" ,
" fields" : [
{" name" : " name" , " type" : " string" },
{" name" : " age" , " type" : " int" },
{" name" : " isStudent" , " type" : " boolean" },
{" name" : " courses" , " type" : {" type" : " array" , " items" : " string" }}
]
}

4. Parquet
● Description: A columnar storagefileformat optimized for usewith Big Data processing
frameworks.
● Strengths:
o E fficient in terms of storagespaceand read/writeperformancefor largedatasets.

o Optimized for analytical queries, particularly thoseinvolving columnar operations.

o Supports complex nested data structures.

● Weaknesses:
o Less suitablefor row-based operations.

o Binary format is not human-readable.

● UseCases:
o Data warehousing and analytics.

o Storageand processing in Hadoop and Spark ecosystems.

● E xample:
o Parquet files arebinary and not typically shown as text, but they can becreated and read

using tools likeApacheSpark.

5. ORC (Optimized RowColumnar)


● Description: A columnar storageformat for Hadoop that uses compression, indexing, and
schema evolution to optimizestorageand query performance.
● Strengths:
o High compression ratios, leading to reduced storagecosts.

o Fast query performancedueto optimized reading of column data.

o Good support for schema evolution.

● Weaknesses:
o Not as widely adopted outsidetheHadoop ecosystem.

o LikeParquet, it is a binary format and not human-readable.

● UseCases:
o Data warehousing in Hadoop.

o Analytical processing with Hiveand other Big Data tools.

● E xample:
o ORC files arebinary and aretypically created and managed using Hadoop-related tools.

Comparison of Big Data Serialization Formats

Feature JSON XML Avro Parquet ORC

Human-
Yes Yes No No No
Readable

Optional
Schema Yes (DTD,
(JSON Yes Yes Yes
Support XSD)
Schema)

Compression No No Yes Yes Yes

Read/Write
Moderate Slow Fast Fast Fast
Performance

Ideal UseCase Web APIs, Document Hadoop, Data Data


E xchange,

Config Data Data Warehousing, Warehousing,


Files Interchange Serialization Analytics Analytics
Conclusion

Selecting theappropriateserialization format depends on thespecific requirements of theusecase,


including factors such as readability, storageefficiency, schema evolution, and integration with existing
Big Data tools. JSON and XML aresuitablefor scenarios requiring human-readableformats, whileAvro,
Parquet, and ORC areoptimized for storageand processing in Big Data environments.

Big data serialization formats


1. Apache Avro
● Schema-based: Data is serialized according to a schema, which is stored along with thedata.
● Binary format: E fficient storageand quick read/writeperformance.
● Schema evolution: Supports adding fields and other modifications without breaking
compatibility.
● Integration: Well-integrated with Hadoop, Spark, and other big data tools.
● Usecases: Ideal for data exchangebetween systems, especially in big data pipelines.

2. Protocol Buffers (Protobuf)


● Schema-based: Requires defining a schema (.proto file) to structurethedata.
● Binary format: Compact and efficient for both storageand transmission.
● Languagesupport: Supports multipleprogramming languages (Java, C++, Python, etc.).
● Schema evolution: Handles changes likeadding or removing fields gracefully.
● Usecases: Good for RPC (RemoteProcedureCall) protocols and data storage.

3. Apache Thrift
● Schema-based: Uses IDL (InterfaceDefinition Language) to definedata structures.
● Binary format: Compact and efficient.
● Languagesupport: Supports a widerangeof programming languages.
● Servicedefinition: Besides serialization, it also provides tools for building RPC services.
● Usecases: Suitablefor cross-languageservices and data serialization.

4. Apache Parquet
● Columnar storageformat: E fficient for read-heavy operations on largedatasets.
● Schema-based: Stores data along with its schema.
● Optimized for Hadoop: Designed to work well with Hadoop ecosystems, including Spark.
● Compression: Supports various compression methods for efficient storage.
● Usecases: Best for analytical queries wherecolumnar access patterns arecommon.

5. ORC (Optimized RowColumnar)


● Columnar storageformat: Optimized for read-heavy operations, similar to Parquet.
● Schema-based: E mbeds schema with data.
● Compression: Provides efficient compression methods.
● Optimized for Hadoop: Works well with theHadoop ecosystem.
● Usecases: Ideal for large-scaledata processing tasks, especially in Hive.

6. JSON
● Human-readable: Text-based and easily readableby humans.
● Schema-less: Flexible, but can lead to inconsistencies.
● Interoperability: Widely used for web APIs and data interchange.
● Performance: Not as efficient as binary formats in terms of storageand parsing speed.
● Usecases: Great for configuration files, web APIs, and situations wherehuman readability is
important.
7. XML
● Human-readable: Text-based and moreverbosethan JSON.
● Schema support: Can useDTD or XSD to definestructure.
● Interoperability: Widely used for data interchangeand configuration.
● Performance: Less efficient in terms of storageand parsing compared to binary formats.
● Usecases: Suitablefor document-centric data exchange, configuration files, and industry-
specific standards (e.g., SOAP).

8. MessagePack
● Binary format: Moreefficient than JSON but retains theflexibility of schema-less data.
● Compact: Smaller sizecompared to JSON.
● Languagesupport: Supports many programming languages.
● Usecases: Useful for scenarios whereJSON is used but performanceand spaceefficiency are
concerns.

9. Apache Arrow
● Columnar format: Designed for efficient analytics and processing.
● In-memory: Optimized for in-memory storageand operations.
● Interoperability: Facilitates data interchangebetween different data processing systems.
● Usecases: Ideal for in-memory data processing tasks and interoperability between big data
systems.

Key Considerations:
● Schema evolution: Howwell theformat supports changes in data structureover time.
● Performance: Read/writespeed and storageefficiency.
● Interoperability: Compatibility with different systems and languages.
● E aseof use: Complexity of setup and usage.
● Compression: Availability and efficiency of compression methods.

Theseserialization formats arechosen based on specific usecases and requirements of thedata


processing pipelineor system.

You might also like