0% found this document useful (0 votes)

41 views32 pages

Learn 2

Uploaded by

shantanukk0108

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views32 pages

Learn 2

Uploaded by

shantanukk0108

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

PYQs

For a healthcare use case scenario, where does the volume, velocity, and variety of data necessitate the
adoption of big data analytic techniques?

The healthcare sector generates vast and complex data, making traditional data management methods inadequate.
The adoption of big data analytics is critical to process, store, and derive actionable insights from this data efficiently,
especially due to its volume, velocity, and variety.

1. Volume
• Healthcare data includes electronic health records (EHRs), medical imaging, genomic data, clinical trials, and
insurance claims.
• Massive data sources such as IoT-enabled devices (wearables, sensors) and public health data contribute
significantly to the volume.
• Big data platforms like Hadoop and cloud storage systems help handle and store this massive amount of
structured and unstructured data.

2. Velocity
• Real-time data from patient monitoring systems, wearables, and emergency response systems require
immediate processing to ensure timely interventions.
• For example, heart rate or blood pressure data from a wearable device must be analyzed instantly to alert
healthcare providers in case of abnormalities.
• Big data tools like Apache Kafka and Spark Streaming enable real-time data processing and decision-making.

3. Variety
• Healthcare data is highly diverse, encompassing structured data (EHRs, laboratory results), semi-structured
data (sensor logs, XML), and unstructured data (physician notes, radiology images, video recordings).
• Advanced analytics techniques, including natural language processing (NLP) and image recognition, are
employed to extract meaningful insights from varied data formats.

Applications of Big Data Analytics in Healthcare

1. Personalized Medicine: Combines genetic data, lifestyle information, and clinical history to create
customized treatment plans.
2. Predictive Analytics: Anticipates disease outbreaks and individual health risks using historical and real-time
data.
3. Population Health Management: Monitors large-scale health trends to inform public health policies and
resource distribution.
4. Fraud Detection: Identifies anomalies in insurance claims to prevent fraudulent activities.
5. Operational Optimization: Enhances hospital management by predicting patient admissions and optimizing
resource allocation.

Design a MapReduce algorithm to calculate the total number of occurrences of various words in a
dataset using the map, shuffle, combiner, and reduce phases. Assume that the dataset consists of a large
size file, and your task is to efficiently count the occurrences of each word across the document.

MapReduce Algorithm to Count Word Occurrences

The goal is to calculate the total occurrences of each word in a large dataset using the MapReduce paradigm. Below
is the detailed explanation of each phase:

1. Map Phase
• Input: Splits of the large file (key: line number, value: text of the line).
• Process:
o Split each line into individual words using a delimiter (e.g., spaces, punctuation).
o Emit intermediate key-value pairs in the form of (word, 1) for each word in the line.
• Output: List of key-value pairs (word, 1).

2. Shuffle and Sort Phase

• Purpose:
o Aggregate all values with the same key (word).
o Group (word, 1) pairs by word to prepare for reduction.
• Output: Key-value pairs grouped by key: (word, [1, 1, ..., 1]).

3. Combiner Phase (Optional but Recommended for Efficiency)

• Local aggregation at the mapper level.
• Combine the intermediate results before sending them to the reducer.
• Input: Key-value pairs (word, 1) from the mapper.
• Process: Sum values for each word locally on the mapper node.
• Output: Intermediate results such as (word, count).

4. Reduce Phase
• Input: Key-value pairs (word, [1, 1, ..., 1]) or (word, [partial counts]).
• Process:
o Sum the values for each key to calculate the total count.
o Emit final key-value pairs in the form of (word, total_count).
• Output: Final word count results, e.g., (word, total_count) for all words.

Example Walkthrough
Dataset:
Input Text: "cat dog cat mouse dog"
Map Phase Output:
(cat, 1), (dog, 1), (cat, 1), (mouse, 1), (dog, 1)
Combiner Phase Output (Optional):
(cat, 2), (dog, 2), (mouse, 1)
Shuffle and Sort Phase Output:
(cat, [1, 1]), (dog, [1, 1]), (mouse, [1])
Reduce Phase Output:
(cat, 2), (dog, 2), (mouse, 1)

Describe the role of visualization libraries and tools in creating insightful visualizations for big data
analysis.
Visualization libraries and tools play a crucial role in making sense of large and complex datasets by transforming raw
data into accessible, interpretable visual formats. They help to uncover patterns, trends, and outliers that would be
difficult to detect otherwise.

Key Roles:
1. Simplifying Complexity: They convert massive datasets into intuitive visual representations like graphs,
charts, and heatmaps, making it easier to identify patterns and trends.
2. Real-Time Insights: Many visualizations tools support real-time data streaming, allowing for dynamic updates
and immediate insights from big data.
3. Interactive Exploration: Tools like Tableau, D3.js, and Power BI allow users to interact with data, drill down
into specific points, and explore relationships between variables.
4. Enhanced Decision-Making: Visualizations help decision-makers quickly grasp key insights, leading to more
informed decisions in business, healthcare, and other sectors.
5. Storytelling with Data: They enable data storytelling by visually representing insights in a clear, compelling
way to communicate findings to stakeholders.
Visualization tools, such as Tableau, D3.js, and Matplotlib, help bridge the gap between raw data and actionable
insights, improving the efficiency of big data analysis and fostering data-driven decision-making.

How is reliability ensured in Hadoop?

Hadoop ensures reliability through several key features:
1. Data Replication: Data in Hadoop is split into blocks, and each block is replicated across multiple nodes
(default replication factor is 3). This ensures that even if a node or disk fails, data can still be retrieved from
other replicas.
2. Fault Tolerance: Hadoop's distributed architecture ensures that if a task or node fails, the system
automatically reroutes the task to another node, preventing data loss and ensuring continuous processing.
3. HDFS (Hadoop Distributed File System): HDFS stores data across multiple machines with checksums to
detect and correct errors during reads and writes, ensuring data integrity.
4. Job Resilience: The Hadoop JobTracker (or ResourceManager in YARN) monitors tasks, reschedules failed
jobs, and guarantees the completion of tasks even in the event of failures.
5. Data Integrity: HDFS uses checksums for data blocks to detect any corruption during storage or transmission,
and it will automatically recreate corrupt blocks from replicas.

Explain the concept of RDD and their role in spark programming.

RDD (Resilient Distributed Dataset): RDD is the fundamental data structure in Apache Spark. It is an immutable,
distributed collection of objects that can be processed in parallel across a cluster. RDDs are fault-tolerant, meaning
that if a partition of an RDD is lost, it can be recomputed from the original data.
Role in Spark:
• RDDs allow distributed data processing by enabling parallel operations like map(), filter(), and reduce().
• They offer fault tolerance and scalability, making them ideal for handling large-scale data.
• RDDs are the building blocks for Spark transformations and actions, serving as the primary abstraction for
handling data.

How does Spark Streaming facilitate the processing of continuous streams of data, and what are its
advantages over batch processing frameworks like MapReduce?

Spark Streaming: Spark Streaming processes real-time data by breaking it into small batches of data (micro-batches),
which are then processed using the same Spark engine used for batch processing. This allows continuous data
streams (like sensor data, logs, etc.) to be processed in near real-time.
Advantages over Batch Processing (MapReduce):
• Real-Time Processing: Unlike MapReduce, which processes data in large, discrete batches, Spark Streaming
processes data as it arrives, enabling real-time analysis.
• Lower Latency: Spark Streaming reduces the time between data arrival and its processing, offering lower
latency compared to batch processing.
• Ease of Use: Spark Streaming leverages the same APIs used in batch processing (like RDDs), making it easier
for developers to write continuous data processing applications.
• Fault Tolerance: Spark Streaming provides the same fault tolerance as Spark through data replication and
checkpointing mechanisms, ensuring reliable data processing.

Describe the flow of data in distributed file system

• Data Ingestion: Data is initially uploaded to the distributed file system (e.g., HDFS). It is split into smaller blocks
and stored across multiple nodes in the cluster.
• Data Distribution: Each block is replicated (typically 3 copies) across different nodes to ensure fault tolerance.
• Data Access: When a user or application requests data, the system retrieves the required block from the node
where it resides, ensuring high availability.
• Fault Tolerance: If a node fails, the system uses the replicated blocks on other nodes to ensure data availability,
without data loss.
• Data Processing: Distributed processing frameworks like Hadoop or Spark access and process these blocks in
parallel, improving speed and scalability.

List five transformation functions of Spark with suitable example

1. map(): Applies a function to each element in the RDD and returns a new RDD.
o Example: rdd.map(x => x * 2) will multiply each element by 2.
2. filter(): Filters elements of an RDD based on a predicate function.
o Example: rdd.filter(x => x > 5) filters out elements greater than 5.
3. flatMap(): Similar to map(), but can return multiple output elements for each input element.
o Example: rdd.flatMap(x => x.split(" ")) splits each string into words.
4. reduceByKey(): Aggregates data by key using a given function.
o Example: rdd.reduceByKey((a, b) => a + b) sums the values of each key.
5. groupByKey(): Groups data by key, returning a pair of key and a collection of associated values.
o Example: rdd.groupByKey() groups data by their keys, but does not perform any aggregation.

Discuss different persistence strategies in Spark.

1. MEMORY_ONLY: Stores data only in memory as deserialized objects. Fastest but may not fit large data in
memory.
2. MEMORY_AND_DISK: Stores data in memory, spilling to disk if necessary. Slower than MEMORY_ONLY but more
fault-tolerant.
3. MEMORY_ONLY_SER: Stores data in memory as serialized objects. More space-efficient but slower than
deserialized storage.
4. DISK_ONLY: Stores data only on disk. Suitable for very large datasets but involves slower I/O operations.
5. MEMORY_AND_DISK_SER: Similar to MEMORY_AND_DISK, but stores data as serialized objects in memory and
on disk, offering better space efficiency with additional processing cost.

Compare and contrast Pig Latin with SQL in terms of their syntax, capabilities, and performance for data
processing tasks.

1. Syntax
• Pig Latin:
o Designed for data flow processing, Pig Latin uses a procedural style.
o Statements are sequential and describe the series of steps needed to process data.
o Example:
data = LOAD 'data.txt' AS (name:chararray, age:int);
filtered = FILTER data BY age > 30;
GROUP data BY name;
• SQL:
o SQL is declarative and uses a set-based approach, focusing on what the result should look like, not
how to compute it.
o Example:
SELECT name, AVG(age) FROM data WHERE age > 30 GROUP BY name;
2. Capabilities
• Pig Latin:
o Data Flow Model: Supports complex data transformations, often involving multiple steps.
o Flexibility: Can handle semi-structured data (like logs, JSON, etc.), making it ideal for ETL processes.
o User-Defined Functions (UDFs): Extends functionality through custom Java functions.
o Complex transformations: Easier for tasks like joins, grouping, filtering, and aggregating across large
datasets.
• SQL:
o Relational Query Language: Focuses on structured data stored in relational databases.
o Set-based Operations: Primarily used for operations on tables and requires data to be organized in
structured formats (like relational schemas).
o Standardized: Has standardized syntax, making it universally applicable to relational databases.
o Limited to relational data: Not as flexible with semi-structured or unstructured data.
3. Performance
• Pig Latin:
o Optimized for Large-Scale Data: Pig can optimize queries at runtime using the MapReduce model,
though it’s less efficient for simple queries compared to SQL.
o Latency: Due to its procedural nature, Pig might have higher latency for certain tasks that SQL could
handle more efficiently.
o Parallel Execution: Stronger in handling complex transformations over large datasets in parallel.
• SQL:
o Optimized Query Execution: SQL engines (like Apache Hive on Hadoop or PostgreSQL) have highly
optimized query planners and indexes.
o Faster for simple queries: SQL can be faster for simple data retrieval, aggregation, and joins due to
its declarative nature and query optimizers.
o Limited for large-scale distributed processing: While SQL can scale with databases, it often doesn't
handle large, unstructured datasets as efficiently as Pig or other Hadoop-based tools.

Apply the transformation function in Scala to count the occurrences of each word in a text file. Given that
file path is stored in variable path_textFile. Design appropriate code for the below statements.
val textFile =
val counts =

val textFile = sc.textFile(path_textFile) // Load the text file

val counts = textFile.flatMap(line => line.split(" ")) // Split each line into words
.map(word => (word, 1)) // Map each word to a tuple (word, 1)
.reduceByKey(_ + _) // Reduce by key to count occurrences of each word
Explanation:
• sc.textFile(path_textFile): This loads the text file from the given path_textFile.
• flatMap(line => line.split(" ")): This splits each line into words by space and flattens the results into a single
collection.
• map(word => (word, 1)): This creates a tuple (word, 1) for each word, representing its occurrence.
• reduceByKey(_ + _): This reduces the tuples by key (i.e., word) and sums the values (the occurrences) to
count the frequency of each word.
Result:
The counts RDD will contain tuples of words and their corresponding counts. You can then perform actions on
counts, such as collect(), save(), or display them. For example:
counts.collect().foreach(println) // Print the word counts

Design a Kafka streaming architecture for Uber's ride-booking platform to support real-time data
processing and analytics. Discuss the key components, data flow, and use cases within the Kafka
ecosystem. considering the challenges and requirements specific to Uber's operations. (not in syllabus)

Key Components:
1. Producers:
o Uber's mobile apps and backend services (ride requests, user activities, GPS data, etc.) act as Kafka
producers.
o These components send real-time data (e.g., user ride requests, location updates, driver status) to
Kafka topics.
2. Kafka Brokers:
o Kafka brokers handle the ingestion and distribution of incoming data across multiple partitions for
scalability and fault tolerance.
o Uber can have multiple Kafka brokers to handle high traffic loads and ensure high availability.
3. Kafka Topics:
o Topics represent different data streams (e.g., "ride_requests", "driver_status", "trip_updates").
o Each topic can have multiple partitions to ensure parallel processing and fault tolerance.
4. Kafka Consumers:
o Uber’s analytics platforms, real-time monitoring systems, and microservices act as consumers,
subscribing to relevant topics (e.g., real-time analytics for surge pricing, driver availability, and ride
matching).
o Consumers process data for specific use cases such as trip status tracking, dynamic pricing, or
customer notifications.
5. Kafka Streams:
o For real-time processing, Kafka Streams API can be used to analyze streaming data, such as
computing real-time ride availability, monitoring traffic conditions, or generating reports on driver
performance.
6. Sink Systems:
o Processed data is pushed to databases, data lakes, or external storage systems for further analytics
and reporting (e.g., HDFS, relational databases, or NoSQL stores).
Data Flow:
1. Ride Requests:
o When a user requests a ride, the mobile app (producer) sends a message to the "ride_requests"
Kafka topic.
o Kafka brokers store and distribute the messages to consumers.
2. Driver Updates:
o Drivers update their status (available, busy, offline) via mobile app, and the data is sent to the
"driver_status" Kafka topic.
3. Real-Time Processing:
o Kafka consumers, such as real-time analytics engines or microservices, process ride requests and
driver status updates. For example, Kafka Streams can calculate the nearest available drivers,
dynamically adjust pricing based on traffic data, and match riders with drivers.
4. Analytics and Monitoring:
o Data flows to analytics platforms, where Uber's business intelligence teams can run batch or real-
time analytics on ride patterns, driver efficiency, user behavior, and system performance.
Use Cases:
1. Real-Time Ride Matching:
o Kafka helps Uber match riders with available drivers in real-time by processing incoming ride
requests and driver statuses.
2. Dynamic Pricing (Surge Pricing):
o Kafka processes real-time traffic data and demand-supply balance to implement surge pricing
algorithms.
3. Driver and Rider Notifications:
o Kafka enables immediate notifications to riders (ride confirmations, driver arrival) and drivers (new
ride requests, updates).
4. Fraud Detection:
o Real-time processing of patterns from ride requests and payment data can help identify fraudulent
activities.
Design a Pig script to process a large log file containing user activity data. The script should extract
relevant information such as user IDs, timestamps, and actions performed. and perform aggregation
tasks like counting the number of actions per user, at the end also store the aggregated results.

To process a large log file containing user activity data using Apache Pig, you need to design a Pig script that performs
the following tasks:
1. Extract relevant data: Parse the log file to extract user IDs, timestamps, and actions performed.
2. Aggregate data: Count the number of actions per user.
3. Store results: Save the aggregated results to a desired output location (e.g., HDFS).
Here is a step-by-step Pig script that achieves this:

Load the log file (assuming it's in a CSV format with user_id, timestamp, and action columns)
logs = LOAD 'hdfs://path_to_log_file/user_activity.log' USING PigStorage(',') AS
(user_id:chararray, timestamp:chararray, action:chararray);

Filter out any empty or malformed records (optional step based on your data)
filtered_logs = FILTER logs BY user_id IS NOT NULL AND action IS NOT NULL;

Group the data by user_id

grouped_by_user = GROUP filtered_logs BY user_id;

Count the number of actions for each user

action_count = FOREACH grouped_by_user GENERATE group AS user_id, COUNT(filtered_logs)
AS num_actions;

Store the aggregated results to a file (output directory in HDFS)

STORE action_count INTO 'hdfs://path_to_output/user_action_counts' USING
PigStorage(',');

Develop a Spark SQL query to perform the following:

i. The average age of individuals involved in marital affairs.
ii. The maximum number of years married among individuals with at least one child.
iii. The percentage of marriages rated above 4 (considering 5 as the maximum rating) among individuals
who reported having affairs.
iv. The average age of individuals involved in affairs, grouped by the number of children they have.

i. Average age of individuals involved in marital affairs

SELECT AVG(age)
FROM marriage_data
WHERE affairs > 0
ii. Maximum number of years married among individuals with at least one child
SELECT MAX(yrs_married)
FROM marriage_data
WHERE children > 0
iii. Percentage of marriages rated above 4 among individuals who reported having affairs
SELECT (COUNT(CASE WHEN rate_marriage > 4 THEN 1 END) * 100.0) / COUNT(*)
FROM marriage_data
WHERE affairs > 0
iv. Average age of individuals involved in affairs, grouped by the number of children they have
SELECT children, AVG(age)
FROM marriage_data
WHERE affairs > 0
GROUP BY children
Design a MongoDB Schema for the given RDBMS Schema and perform the following operations on it.
i. Design a mongodb command to find product price less than 5000
ii. Design a mongodb command to drop the material collection
Feedback: feedback_id, product_id, customer_id, rating, comment
Product: product_id, nane, price, description
Material: material_id, product_id, name, Quantity, unit price
MongoDB Schema Design:
1. Products Collection
{
"_id": "product_id", // Unique identifier for the product
"name": "string",
"price": "number",
"description": "string"
}
2. Materials Collection
{
"_id": "material_id", // Unique identifier for the material
"product_id": "string", // Reference to the associated product
"name": "string",
"quantity": "number",
"unit_price": "number"
}

3. Feedback Collection
{
"_id": "feedback_id", // Unique identifier for the feedback
"product_id": "string", // Reference to the associated product
"customer_id": "string", // ID of the customer who gave the feedback
"rating": "number",
"comment": "string"
}

MongoDB Commands
i. Find products with price less than 5000:
db.product.find({ price: { $lt: 5000 } })
ii. Drop the material collection:
db.material.drop()
Discuss ML Pipeline in text processing for large-scale datasets with suitable example.
An ML pipeline for text processing involves a sequence of stages to preprocess text data, extract features, train
models, and evaluate results. It ensures scalability and efficiency for large datasets.
Key Stages:
1. Data Ingestion: Load large-scale text datasets from distributed storage (e.g., HDFS, S3).
2. Text Preprocessing:
o Tokenization: Split text into words or tokens.
o Stopword Removal: Eliminate common, non-informative words.
o Normalization: Convert text to lowercase, remove punctuation.
3. Feature Extraction:
o TF-IDF (Term Frequency-Inverse Document Frequency).
o Word Embeddings (e.g., Word2Vec, GloVe).
4. Model Training:
o Use scalable ML algorithms like Logistic Regression, SVM, or Neural Networks (e.g., LSTMs for NLP
tasks).
5. Evaluation:
o Assess model performance using metrics like accuracy or F1-score.
Example:
For sentiment analysis of customer reviews:
• Input: A dataset with review texts and labels (positive/negative).
• Process:
o Preprocess text (tokenize, normalize, remove stopwords).
o Extract features using TF-IDF.
o Train a Logistic Regression model on distributed systems (e.g., Spark MLlib).
• Output: Predict sentiment for new reviews.
Page Rank Question
Find page rank of all web pages

Solution : A = 18, B = 18, C = 6, D = 24, E = 17, F = 27, G = 34

Considering the map-reduce programming paradigm write the code in scala/pyspark to perform coin
counting. Consider that coins with different denominations are listed in text files separated by space.

PySpark Code for Coin Counting

from pyspark.sql import SparkSession

# Initialize Spark session

spark = SparkSession.builder.appName("CoinCounting").getOrCreate()
sc = spark.sparkContext

# Load the text file into an RDD

coins_rdd = sc.textFile("path_to_coin_file.txt") # Replace with the actual file path

# Map phase: Split each line into individual coins

# FlatMap creates a single list of all coins
coins_flat_rdd = coins_rdd.flatMap(lambda line: line.split())

# Reduce phase: Count the occurrences of each denomination

coin_counts_rdd = coins_flat_rdd.map(lambda coin: (coin, 1)).reduceByKey(lambda a, b: a + b)

# Collect the results and display them

coin_counts = coin_counts_rdd.collect()

# Print the results

for coin, count in coin_counts:
print(f"Coin: {coin}, Count: {count}")

# Stop Spark session

spark.stop()

How many data nodes would be the minimum requirement for allocating these jobs to the Hadoop
system. Draw the allocation of blocks in the appropriate data nodes.

The minimum number of data nodes required is 3, based on the maximum replication factor of File3 (3 replicas). The
block allocation is as follows:
File Name Blocks Replication Factor Allocation
File1 [1, 2, 3] 1 DN1: [1], DN2: [2], DN3: [3]
File2 [4, 5, 6] 2 DN1: [4, 5], DN2: [4, 6], DN3: [5, 6]
File3 [7, 8, 9] 3 DN1: [7, 8], DN2: [8, 9], DN3: [9, 7]
File4 [10] 2 DN1: [10], DN2: [10]
File5 [11] 1 DN1: [11]

Consider a scenario where you have a list of transactions, where each transaction is represented as
(Transaction ID, Price, Name). Apply filter transformation to identify transactions where the price is
greater than 500, and for the remaining transactions, apply a discount of 10% to the price. Finally, display
the discounted price and Name for each transaction. Write a Scala/pyspark program that performs the
aforementioned operations on the list of transactions.

val transactions = List( ("T001", 600, "Product A"), ("T002", 450, "Product B"), ("T003", 700, "Product C"),
("T004", 800, "Product D"), ("T005", 300, "Product E") )
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().appName("Transaction

Filter").master("local").getOrCreate()
import spark.implicits._

// List of transactions
val transactions = List(
("T001", 600, "Product A"),
("T002", 450, "Product B"),
("T003", 700, "Product C"),
("T004", 800, "Product D"),
("T005", 300, "Product E")
)

// Convert to DataFrame
val transactionDF = transactions.toDF("TransactionID", "Price", "Name")

// Apply transformations: Filter and apply discount

val discountedTransactions = transactionDF
.filter($"Price" > 500)
.withColumn("DiscountedPrice", $"Price" * 0.9)
.select("DiscountedPrice", "Name")

// Display the results

discountedTransactions.show()
Data Serialization
Data serialization is the process of converting a data structure or object state into a format that can be stored (for
example, in a file or memory buffer, or transmitted across a network connection link) and reconstructed later in the
same or another computer environment. The opposite of serialization is called deserialization.

There are several data serialization formats, including:

• JSON (JavaScript Object Notation): a lightweight, human-readable, text-based format that is widely used for
asynchronous browser/server communication.
• XML (Extensible Markup Language): a markup language that is used to store and transport data. XML is
often used for large amounts of structured data and has a wide range of applications.
• YAML (Yet Another Markup Language): a human-friendly data serialization standard for all programming
languages. It is often used for configuration files.
• CBOR (Concise Binary Object Representation): a binary data format that is similar to JSON, but more
compact and efficient.
• BSON (Binary JSON): a binary-encoded serialization of JSON-like documents.
• Protocol Buffers (also known as protobuf): a compact binary format developed by Google for high-
performance communication protocols.

Few common approaches for Data Serialization in Big Data Analytics:

1. Apache Avro: Avro is a popular data serialization system that provides a compact binary format and a rich
data structure for big data analytics. Avro supports schema evolution, which allows for changes to the data
structure over time without breaking compatibility.
2. Apache Parquet: Parquet is a columnar storage format that is optimized for big data analytics. It uses a
binary encoding that is more compact and efficient than traditional row-based formats, making it well-suited
for big data analytics.
3. Apache Thrift: Thrift is a data serialization framework that supports efficient serialization and deserialization
of data. It is widely used in big data analytics for its ability to handle complex data structures and support
multiple programming languages.
4. Apache Arrow: Arrow is a high-performance data serialization format that is optimized for big data analytics.
It provides a high-performance binary format for columnar data storage, making it well-suited for analytics
workflows that involve large amounts of data.
Distributed File System
A distributed file system is a type of file system that allows multiple users to access and manage the same data stored
on different nodes or computers in a network. In a distributed file system, the data is stored across multiple nodes in
a way that makes it appear as if it is stored on a single machine, while providing the ability to scale out storage and
processing capabilities as needed.

Some common features of distributed file systems include:

1. Data replication: The ability to replicate data across multiple nodes to ensure data availability and reliability.
2. Data distribution: The ability to divide the data into smaller chunks and distribute them across multiple
nodes for improved scalability and performance.
3. Data access: The ability for multiple users to access and manipulate the same data from different nodes in a
network.
4. Data management: The ability to manage the data and metadata, such as permissions, ownership, and data
placement, in a coordinated and consistent manner.

Examples of widely used distributed file systems include:

1. Hadoop HDFS (Hadoop Distributed File System): A scalable and fault-tolerant file system designed for use
with the Hadoop big data platform.
2. GlusterFS: An open-source, scalable, and highly available distributed file system that can be used on
commodity hardware.

A distributed file system typically provides a number of interfaces that allow users and applications to access and
manage data stored in the system. The most common interfaces in distributed file systems are:

1. File System API: A set of application programming interfaces (APIs) that allow users and applications to
interact with the file system, such as reading and writing files, creating and deleting directories, and
managing metadata.
2. Network File System (NFS) Protocol: A widely used protocol that allows users to access files stored in a
remote file system as if they were stored locally. NFS provides a common set of file operations and is
supported by many operating systems and applications.
3. Server Message Block (SMB) Protocol: A protocol that allows users to access and manage files on a remote
file system using the same operations as they would with a local file system. SMB is commonly used in
Windows-based environments.
4. Object Storage Interface: A set of APIs that provide a way to store and retrieve unstructured data as objects,
rather than as traditional files and directories. Object storage is commonly used in cloud-based storage
environments.
5. Distributed File System Protocol (DFS): A protocol that provides a unified namespace for a set of distributed
file systems, allowing users to access files stored on different nodes as if they were stored in a single file
system.
Data Ingest
Data ingest is the process of bringing data into a data processing or storage system for further analysis and
processing. Data ingest can involve multiple steps, including data acquisition, data extraction, data transformation,
and data loading.

Data ingest is a critical step in many big data analytics pipelines and is typically performed in a scalable and efficient
manner to handle the volume and velocity of the data being ingested. The choice of data ingest tools and
technologies depends on the specific requirements of the data processing and storage system, such as the type and
format of the data, the scale of the data, and the performance requirements of the system.

Data ingest with Flume and Scoop; Flume and Scoop are two popular tools for data ingest in big data analytics.

1. Flume is an open-source, distributed data collection and ingestion framework that is designed to make it easy to
move large volumes of data into Hadoop for further processing and analysis. Flume supports a variety of sources,
including log files, databases, and network sockets, and provides a scalable and reliable mechanism for
transmitting data to a Hadoop cluster. Flume is also extensible and can be configured to handle complex data
processing and transformation requirements.
2. Scoop is a tool for data ingestion and data export in Hadoop. It is used to transfer data between Hadoop and
other systems such as relational databases, NoSQL databases, and other data stores. Scoop provides a high-level
command line interface that makes it easy to perform common data transfer tasks, such as loading data into
Hadoop and exporting data from Hadoop. Scoop also supports parallel data transfer and can be used to perform
data ingestion at scale.
Spark Storage Level and Cache Persistence
Spark storage levels and cache persistence are critical for optimizing the performance of Spark applications. These
mechanisms enable the reuse of intermediate computation results, reducing recomputation and improving
efficiency.
Storage Levels in Spark
1. Disk Storage:
o Provides fault-tolerance.
o Suitable for large datasets.
2. Memory Storage:
o Enables faster access to frequently accessed datasets.
3. Off-Heap Storage:
o Stores data outside the JVM heap.
o Prevents impact on garbage collection processes.
Spark Cache vs Persist
• cache():
o Default storage level: MEMORY_ONLY.
o Used for intermediate computations of RDDs, DataFrames, or Datasets.
• persist():
o Allows specifying storage levels like memory, disk, or both.
o Ensures fault-tolerance; lost partitions are recomputed using original transformations.
Advantages of Caching and Persistence
1. Cost Efficiency: Reduces the cost of repeated computations.
2. Time Efficiency: Speeds up jobs by avoiding recomputation.
3. Execution Time Optimization: Frees up resources for more jobs on the same cluster.
Examples
1. Caching Syntax:
val dfCache = df.cache()
dfCache.show(false)
2. Persist Syntax:
val dfPersist = df.persist(StorageLevel.MEMORY_AND_DISK)
dfPersist.show(false)
3. Unpersist Syntax:
val dfPersist = dfPersist.unpersist()

Storage Level Options

1. MEMORY_ONLY: Stores data in memory; recomputes if memory is insufficient.
2. MEMORY_AND_DISK: Stores data in memory and spills to disk when needed.
3. DISK_ONLY: Stores data only on disk, slower due to I/O operations.
Hadoop
Hadoop is an open-source framework that enables the distributed storage and processing of large datasets across
clusters of computers using simple programming models. It is designed to scale from single servers to thousands of
machines, each offering local computation and storage.
Key Modules of Hadoop:
1. Hadoop Common: Provides shared utilities and libraries required by other Hadoop modules.
2. HDFS (Hadoop Distributed File System): A distributed file system that stores data across multiple machines,
ensuring fault tolerance and high throughput.
3. YARN (Yet Another Resource Negotiator): Manages resources and schedules jobs across the cluster.
4. Hadoop MapReduce: A processing model for large-scale data processing that splits jobs into smaller tasks
and processes them in parallel.
Hadoop Stack:
1. HDFS: For distributed storage.
2. MapReduce: For parallel data processing.
3. YARN: For resource management.
4. Additional Tools:
o Hive: Data warehouse infrastructure for querying and managing large datasets.
o HBase: A distributed, scalable database built on HDFS.
o Pig: A high-level platform for creating MapReduce programs.
o Spark: A fast and general engine for big data processing.
HDFS Architecture: Master-Slave (PYQ)
HDFS operates on a Master-Slave architecture, consisting of a NameNode (Master) and multiple DataNodes (Slaves), designed
for distributed storage and fault tolerance.
NameNode: Controller
• File System Namespace Management: The NameNode manages the entire file system's hierarchy and metadata, such
as file names, directories, and access permissions.
• Block Mappings: It keeps track of where blocks of a file are stored across DataNodes, mapping file data to the
respective blocks stored in the cluster.
DataNodes: Work Horses
• Block Operations: DataNodes store actual file data in blocks and perform tasks like read/write operations based on
client requests.
• Replication: DataNodes replicate blocks as instructed by the NameNode to maintain redundancy and ensure fault
tolerance.
Secondary NameNode
• Checkpointing: The Secondary NameNode periodically merges the edit logs with the fsimage (file system image),
optimizing the metadata recovery process in case of a NameNode failure.
• It does not serve as a backup but helps the NameNode restart faster by reducing the metadata logs that need to be
processed.
Multiple-Rack Cluster
HDFS follows a rack-awareness model where DataNodes are grouped into
racks, and block replicas are distributed across different racks for better
fault tolerance and data reliability.
• Reliable Storage: In case a DataNode fails, the NameNode
replicates lost blocks to another node.
• Cross-Rack Replication: To enhance reliability, block replicas are
stored on different racks. This ensures data recovery even in the
event of a full rack failure.
Single Point of Failure
• The NameNode is critical to HDFS operations. If it goes down, the
entire file system becomes inaccessible, making it the Single
Point of Failure (SPOF) in traditional HDFS setups.
HDFS Inside: NameNode
The NameNode manages the FS image (file system structure) and edit log (recent changes) to track all file
operations. It communicates with DataNodes, receiving periodic heartbeats and block reports.
• The Secondary NameNode helps by periodically merging the edit log with the FS image to reduce log size
and maintain system efficiency.
• It performs housekeeping tasks like backing up metadata to ensure the control information is up-to-date.
Important example

HDFS Inside: Read (PYQ)

1. Client connects to NameNode to request data.
2. NameNode provides block locations where the data is stored.
3. Client reads data directly from DataNodes, bypassing the
NameNode.
4. If a DataNode fails, the client connects to another node serving the missing block.
HDFS Inside: Write
1. Client connects to NameNode to initiate a write operation.
2. NameNode assigns DataNodes for the data blocks.
3. Client writes blocks directly to DataNodes with the specified
replication factor.
4. If a DataNode fails, the NameNode replicates the missing blocks
to maintain redundancy.
MapReduce
MapReduce is a programming model in Hadoop designed for processing
large datasets in parallel across a distributed cluster. It divides tasks into
two main phases: Map and Reduce. The Map phase processes and filters
data, while the Reduce phase aggregates the results.

MapReduce Architecture Workflow

1. Client Submits Job: The client submits the MapReduce job to the
Job Tracker and uploads the job code to HDFS.
2. Job Tracker Contacts NameNode: The Job Tracker communicates with the NameNode to locate the required
data blocks.
3. Job Execution Plan: The Job Tracker creates
an execution plan and assigns tasks to Task
Trackers. The plan includes:
o Mapper: Processes input data into
key-value pairs.
o Combiner (optional): Performs local
aggregation on Mapper output to
reduce the amount of data
transferred to the Reducer.
o Reducer: Aggregates the results of
the Map phase based on keys.
4. Task Execution: The Task Trackers execute the assigned tasks (Map and Reduce) and report their
progress/status back to the Job Tracker.
5. Task Phases Management: The Job Tracker oversees the task phases (Map and Reduce), ensuring they run
smoothly.
6. Job Completion: Once all tasks are completed, the Job Tracker finishes the job and updates the status for
the client.
Example of MapReduce
Spark
Apache Spark is an open-source, distributed computing system used for big data processing and analytics. It
supports various workloads, including batch processing, interactive queries, streaming, machine learning, and graph
processing. Spark is known for its in-memory processing, making it much faster than Hadoop's MapReduce.

Spark Architecture
Spark follows a master-slave architecture consisting of several key components:
1. Driver: The main program that defines transformations and actions on data. It coordinates and schedules
tasks.
2. Cluster Manager: Manages the cluster resources and allocates them to applications. Examples include
YARN, Mesos, or a standalone manager.
3. Executors: Workers that run tasks and store data. Each node in the cluster runs an executor.
4. Tasks: Individual units of work sent to the executors.
5. RDD (Resilient Distributed Dataset): A distributed collection of data that is fault-tolerant and can be
operated on in parallel.
Resilient Distributed Datasets (RDDs)
RDDs are the core abstraction in Apache Spark, representing a distributed collection of data that is fault-tolerant
and can be processed in parallel across a cluster. They allow efficient and large-scale data processing, supporting
both in-memory and disk storage.
Key Features of RDDs:
1. Fault Tolerance: RDDs can recover data lost due to failures using lineage information.
2. Parallel Processing: Operations are distributed across multiple nodes in a cluster.
3. Lazy Evaluation: RDD operations are only executed when an action is triggered.
Operations on RDDs
The two major types of operations available are transformations and actions.
1. Transformations:
o Return a new, modified RDD based on the original.
o Common transformations include:
▪ map()
▪ filter()
▪ sample()
▪ union()
2. Actions:
o Return a value based on some computation performed on an RDD.
o Common actions include:
▪ reduce()
▪ count()
▪ first()
▪ foreach()
Iterative Operations on MapReduce
MapReduce lacks efficient support for iterative operations since it writes intermediate results to disk after each
Map and Reduce phase. This is inefficient for algorithms that require multiple passes over the same data (e.g.,
machine learning algorithms), leading to high latency and slower processing times.

Interactive Operations on MapReduce

MapReduce is also inefficient for interactive operations where users query data in real-time. Each query triggers a
separate job, reading from and writing to disk, which results in high latency and slow responses.

Spark Ecosystem
The Spark ecosystem consists of several components and libraries that extend Spark’s capabilities for big data
processing:
1. Spark Core: The foundation that provides basic functionalities like task scheduling and memory
management.
2. Spark SQL: Enables querying of structured data using SQL and DataFrame API.
3. Spark Streaming: Allows real-time stream processing.
4. MLlib: A machine learning library for scalable algorithms like classification and clustering.
5. GraphX: Spark’s API for graph processing and graph-parallel computations.
6. SparkR: R language integration for statistical computing.
Spark vs Hadoop (PYQ)
Feature Apache Spark Apache Hadoop (MapReduce)
Processing Model In-memory computing Disk-based, batch processing
Speed Faster (due to in-memory processing) Slower (due to disk I/O for intermediate data)
Supports APIs in Scala, Java, Python, R; high- Java-based, lower-level MapReduce
Ease of Use
level APIs (DataFrames, SQL) programming
Data Processing Batch, real-time (streaming), interactive, and
Primarily batch processing
Type iterative
Uses data replication across nodes for fault
Fault Tolerance Uses lineage and DAG to recompute lost data
tolerance
Can work with various data sources (HDFS, S3, Works mainly with HDFS (Hadoop Distributed
Data Storage
HBase, etc.) File System)
High-latency (disk I/O for every MapReduce
Latency Low-latency (due to in-memory processing)
step)
Supports real-time stream processing (via
Streaming Support Does not support real-time streaming
Spark Streaming)
Machine Learning Includes MLlib for machine learning tasks Requires third-party libraries (e.g., Mahout)
Graph Processing Provides GraphX for graph computations Lacks built-in graph processing capabilities
Compatible with Hadoop ecosystem (can run
Compatibility Runs only in Hadoop ecosystem (HDFS, YARN)
on YARN, access HDFS)
Resource Can use YARN, Mesos, or standalone cluster
Relies on YARN for resource management
Management manager
Iterative Optimized for iterative algorithms (e.g., ML Not optimized for iterative algorithms (requires
Algorithms algorithms) multiple MapReduce jobs)
Maturity Newer, but rapidly growing in popularity Older, more stable, and widely used
Spark Scheduler
The Spark scheduler handles the execution of jobs by dividing them into stages and tasks. It uses a Dryad-like
Directed Acyclic Graph (DAG) to represent job execution, where nodes are operations (transformations) and
edges represent dependencies.
Key Features of the Spark Scheduler:
1. Pipelines functions within a stage: It groups operations into stages, executing pipelined functions (e.g.,
map, filter) without waiting.
2. Cache-aware work reuse & locality: It optimizes execution by reusing cached data and scheduling tasks
close to where data is stored.
3. Partitioning-aware: To avoid expensive data shuffling, it keeps track of data partitioning.
Example of Stages:
• Stage 1: Operations like groupBy, map (A, B, C, D) are executed.
• Stage 2: Combines transformations like union and join.
• Stage 3: Uses cached data partitions to minimize recomputation.
Hadoop Ecosystem (PYQ)
• Data Storage:
o HDFS: Distributed file system for storing large files.
o HBase: Columnar database for real-time access to large datasets.
• Data Processing:
o MapReduce: Parallel processing framework for handling large-scale data.
o YARN: Manages cluster resources and job scheduling.
• Data Access:
o Hive: SQL-like interface for querying data in HDFS.
o Pig: Data flow scripting for processing large datasets.
o Mahout: Machine learning library for scalable algorithms.
o Avro: Framework for data serialization and RPC.
o Sqoop: Connects and imports data between Hadoop and relational databases.
• Data Management:
o Oozie: Workflow scheduler for managing Hadoop jobs.
o Chukwa: System for monitoring and collecting data.
o Flume: Collects and aggregates log data from various sources.
o ZooKeeper: Coordination and management for distributed applications.
Pig
Apache Pig is a high-level platform for processing large datasets in Hadoop. It uses a language called Pig Latin for
expressing data transformations. Pig simplifies coding in MapReduce by providing a more accessible scripting
interface.
Why do we need Pig?
• Simplified MapReduce: Writing MapReduce directly is complex; Pig provides a more intuitive approach.
• Data Transformation: Useful for tasks like filtering, grouping, and joining large datasets.
• Less Code: With Pig, operations are concise and easier to maintain.
• Extensibility: Supports user-defined functions (UDFs) for custom tasks.
Features of Pig
• Ease of Programming: Pig Latin is easier than raw MapReduce.
• Data Flow Language: Describes transformations as a data flow.
• Schema Flexibility: Works with both structured and unstructured data.
• Optimization: Automatically optimizes execution by generating efficient MapReduce code.
• Extensibility: Supports UDFs in multiple languages (Java, Python).
Applications of Pig
• Log Analysis: Analyze and process web server logs.
• Data Processing: ETL (Extract, Transform, Load) tasks for large datasets.
• Ad Targeting: For marketing data, processing user behavior data.
• Data Research: Quick prototyping of algorithms in big data analytics.
Apache Pig Architecture
Apache Pig’s architecture is designed to execute Pig Latin scripts efficiently over large datasets. The key components
of Pig's architecture include:
1. Pig Latin Script: The user writes a Pig Latin script to specify data transformations.
2. Parser: Converts the Pig Latin script into a logical plan (a series of steps representing the data flow) after
checking syntax and type.
3. Optimizer: Optimizes the logical plan for better performance, generating an optimized logical plan.
4. Compiler: Converts the optimized logical plan into a physical plan of MapReduce jobs.
5. Execution Engine: This executes the physical plan as MapReduce jobs on a Hadoop cluster.
6. HDFS (Hadoop Distributed File System): The data storage and retrieval system, where Pig processes the
data.
Pig vs MapReduce
Feature Pig MapReduce
Language Pig Latin (high-level scripting) Java (low-level programming)
Ease of Use Easier with fewer lines of code Complex and requires more code
Abstraction Higher level; abstracts MapReduce Low level; direct MapReduce coding
Development Speed Faster for developers Slower; requires detailed coding
Optimization Automatically optimized Manual optimization needed
Use Case For ETL, data analysis, and querying Best for complex and custom operations

Pig vs SQL
Feature Pig SQL
Data Type Support Supports both structured and unstructured data Primarily structured (RDBMS)
Data Processing Procedural, step-by-step data flow Declarative, focus on "what" to retrieve
Schema Requirement Can work with or without schema Requires predefined schema
Platform Designed for Hadoop (Big Data) Designed for RDBMS (Relational Databases)
Flexibility More flexible with unstructured data Limited to structured data
Language Pig Latin (procedural) SQL (declarative)

Pig vs Hive
Feature Pig Hive
Language Pig Latin (procedural) HiveQL (SQL-like, declarative)
Data Handling Works with unstructured and structured data Primarily for structured data
Use Case ETL, data processing, and analysis Data querying and reporting
Execution Translates scripts to MapReduce jobs Also translates HiveQL into MapReduce jobs
Learning Curve Easier for programmers Easier for SQL users
Optimization Automatic but procedural control Query optimization via SQL-based execution plan
Hive Architecture
1. User Interfaces
• Web UI: Web-based interaction.
• CLI: Command Line Interface for executing HiveQL queries.
• HDInsight: Cloud-based interface on Azure for Hive.
2. Meta Store
• Stores metadata like table schemas, partitions, and data locations.
• Uses RDBMS (e.g., MySQL) for managing metadata.
3. HiveQL Process Engine
• Parsing: Converts HiveQL queries into a logical plan.
• Optimization: Optimizes query execution using metadata.
4. Execution Engine
• Executes the optimized query using MapReduce, Tez, or Spark based on the configuration.
5. MapReduce
• Hive translates queries into MapReduce jobs for distributed data processing.
6. HDFS or HBase Data Storage
• HDFS: Default data storage for Hive.
• HBase: Supports NoSQL-style data storage for real-time access.
Big Data
Big Data is a term for extremely large and complex datasets that traditional data processing tools can't handle
efficiently. It involves data from diverse sources and is characterized by its vast scale, rapid growth, and varying
formats.
5 V's of Big Data
1. Volume: Amount of data.
2. Velocity: Speed of data generation.
3. Variety: Types of data.
4. Veracity: Data accuracy.
5. Value: Insights and benefits.

Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
IDS Unit3
No ratings yet
IDS Unit3
19 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Big Data S All Units
No ratings yet
Big Data S All Units
122 pages
Serialization PDF
100% (1)
Serialization PDF
18 pages
Big Data ANAlysis Short
No ratings yet
Big Data ANAlysis Short
114 pages
Big Data Analysis BDA IMP QNA Openinapp
No ratings yet
Big Data Analysis BDA IMP QNA Openinapp
33 pages
Jifs223295 2
No ratings yet
Jifs223295 2
25 pages
Unit-I Material
No ratings yet
Unit-I Material
32 pages
Big Data Analytics Unit - 1 Notes
No ratings yet
Big Data Analytics Unit - 1 Notes
24 pages
Bda ??? ???
No ratings yet
Bda ??? ???
26 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
Big Data and Analytics and MapReduce 29052023 054155pm
No ratings yet
Big Data and Analytics and MapReduce 29052023 054155pm
35 pages
Bdavdoc
No ratings yet
Bdavdoc
41 pages
BAD601 Big Data Model Question Paper Solution Search Creators
No ratings yet
BAD601 Big Data Model Question Paper Solution Search Creators
50 pages
Cloud - UNIT V
No ratings yet
Cloud - UNIT V
18 pages
Serialization of IDOC Message Type
50% (2)
Serialization of IDOC Message Type
23 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Security Practise
No ratings yet
Security Practise
177 pages
BDH Answer Bank
No ratings yet
BDH Answer Bank
21 pages
M5
No ratings yet
M5
18 pages
Fillatre Big Data
No ratings yet
Fillatre Big Data
98 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
Big Data Handling Techniques
No ratings yet
Big Data Handling Techniques
21 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
33 pages
MA - VaishuAchini - VIT - 24 - ICT703 - A3
No ratings yet
MA - VaishuAchini - VIT - 24 - ICT703 - A3
21 pages
Big Data Architecture
No ratings yet
Big Data Architecture
17 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
HAP 780 15 Big Data
No ratings yet
HAP 780 15 Big Data
19 pages
ABSTRACT
No ratings yet
ABSTRACT
9 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
17 pages
Biggdata
No ratings yet
Biggdata
24 pages
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
No ratings yet
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
21 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
Map Reduce
No ratings yet
Map Reduce
36 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
Big Data and Hadoop
No ratings yet
Big Data and Hadoop
8 pages
BDA 2 Marks
No ratings yet
BDA 2 Marks
13 pages
Experiment No - 1 Bda
No ratings yet
Experiment No - 1 Bda
10 pages
ICAI 2023 Paper 3719
No ratings yet
ICAI 2023 Paper 3719
6 pages
Cloud Security UNIT 5
No ratings yet
Cloud Security UNIT 5
4 pages
Bda Unit-1 Notes
No ratings yet
Bda Unit-1 Notes
10 pages
BigdatMid1 Shcema
No ratings yet
BigdatMid1 Shcema
7 pages
I Am Preparing For A Big Data Analytics University...
No ratings yet
I Am Preparing For A Big Data Analytics University...
15 pages
Last Min Preparation - Big Data
No ratings yet
Last Min Preparation - Big Data
5 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
Big Data Analytics
No ratings yet
Big Data Analytics
20 pages
Bigdata
No ratings yet
Bigdata
12 pages
Uc PDF
No ratings yet
Uc PDF
10 pages
4 A Review Paper On Big Data and Hadoop
No ratings yet
4 A Review Paper On Big Data and Hadoop
3 pages
Ashish Presentation Stage1 Modify LR
No ratings yet
Ashish Presentation Stage1 Modify LR
24 pages
Bda Test1 Key Answers
No ratings yet
Bda Test1 Key Answers
7 pages
Hadoop in Bigdata Processing Concept
No ratings yet
Hadoop in Bigdata Processing Concept
2 pages
BDT Viva Questions
No ratings yet
BDT Viva Questions
2 pages
Hadoop OnePage
No ratings yet
Hadoop OnePage
2 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Introduction To Big DAta
No ratings yet
Introduction To Big DAta
2 pages
Big Data and Hadoop Self Notes
No ratings yet
Big Data and Hadoop Self Notes
16 pages
MSBTE Solution App
No ratings yet
MSBTE Solution App
13 pages
Emdep KSK Client Manual EN v5.8
No ratings yet
Emdep KSK Client Manual EN v5.8
232 pages
Exception Handling
33% (3)
Exception Handling
29 pages
Good Java Projects For Resume
100% (1)
Good Java Projects For Resume
5 pages
File Handling
No ratings yet
File Handling
13 pages
Advance PHP
No ratings yet
Advance PHP
15 pages
Java Imp
No ratings yet
Java Imp
67 pages
Preliminary: (MS-PSRP) : Powershell Remoting Protocol Specification
No ratings yet
Preliminary: (MS-PSRP) : Powershell Remoting Protocol Specification
105 pages
Rust API Guidelines
No ratings yet
Rust API Guidelines
49 pages
Big Data Analytics
No ratings yet
Big Data Analytics
8 pages
VC
No ratings yet
VC
43 pages
R22 - JAVA UNIT - II (Files and I - O Streams)
No ratings yet
R22 - JAVA UNIT - II (Files and I - O Streams)
61 pages
Document 4
No ratings yet
Document 4
15 pages
Mod Menu Log - Com - Football.soccer - League
No ratings yet
Mod Menu Log - Com - Football.soccer - League
44 pages
Python Fundamental & Programming
No ratings yet
Python Fundamental & Programming
2 pages
B.Tech VIII BDA Chapter - 3 1
No ratings yet
B.Tech VIII BDA Chapter - 3 1
3 pages
Binary File Handling
No ratings yet
Binary File Handling
15 pages
Study Material Nagarro Placement Interview Question 07122023
No ratings yet
Study Material Nagarro Placement Interview Question 07122023
27 pages
Portnoy
No ratings yet
Portnoy
5 pages
Java QA
No ratings yet
Java QA
8 pages
100 Most Asked Python Interview QnA PDF
No ratings yet
100 Most Asked Python Interview QnA PDF
10 pages
Comparison of File Formats For Big Data
No ratings yet
Comparison of File Formats For Big Data
4 pages
Floating Sensor Network River Studies 2012
No ratings yet
Floating Sensor Network River Studies 2012
14 pages
Backend Task1
No ratings yet
Backend Task1
7 pages
Orleans MSR TR 2014 41
No ratings yet
Orleans MSR TR 2014 41
13 pages
Store Updated
No ratings yet
Store Updated
4 pages
Core Java Concepts-2
No ratings yet
Core Java Concepts-2
1 page

Learn 2

Uploaded by

Learn 2

Uploaded by

PYQs

Applications of Big Data Analytics in Healthcare

MapReduce Algorithm to Count Word Occurrences

2. Shuffle and Sort Phase

3. Combiner Phase (Optional but Recommended for Efficiency)

How is reliability ensured in Hadoop?

Explain the concept of RDD and their role in spark programming.

Describe the flow of data in distributed file system

List five transformation functions of Spark with suitable example

Discuss different persistence strategies in Spark.

val textFile = sc.textFile(path_textFile) // Load the text file

Group the data by user_id

Count the number of actions for each user

Store the aggregated results to a file (output directory in HDFS)

Develop a Spark SQL query to perform the following:

i. Average age of individuals involved in marital affairs

Solution : A = 18, B = 18, C = 6, D = 24, E = 17, F = 27, G = 34

PySpark Code for Coin Counting

# Initialize Spark session

# Load the text file into an RDD

# Map phase: Split each line into individual coins

# Reduce phase: Count the occurrences of each denomination

# Collect the results and display them

# Print the results

# Stop Spark session

val spark = SparkSession.builder().appName("Transaction

// Apply transformations: Filter and apply discount

// Display the results

There are several data serialization formats, including:

Few common approaches for Data Serialization in Big Data Analytics:

Some common features of distributed file systems include:

Examples of widely used distributed file systems include:

Storage Level Options

HDFS Inside: Read (PYQ)

MapReduce Architecture Workflow

Interactive Operations on MapReduce

You might also like