0% found this document useful (0 votes)
45 views13 pages

Msbte UT 1 QB Answers

The document outlines the features and applications of Hadoop, emphasizing its distributed storage, scalability, fault tolerance, and parallel processing capabilities. It also discusses Big Data Analytics (BDA) types, the evolution of Hadoop, and the architecture of Hadoop, including its ecosystem and various stacks. Additionally, it highlights the characteristics of big data, the analytics flow, data science, and challenges associated with big data management.

Uploaded by

yashmoreyt0221
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views13 pages

Msbte UT 1 QB Answers

The document outlines the features and applications of Hadoop, emphasizing its distributed storage, scalability, fault tolerance, and parallel processing capabilities. It also discusses Big Data Analytics (BDA) types, the evolution of Hadoop, and the architecture of Hadoop, including its ecosystem and various stacks. Additionally, it highlights the characteristics of big data, the analytics flow, data science, and challenges associated with big data management.

Uploaded by

yashmoreyt0221
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

2 Marks

Q.1 State features of Hadoop ?


 Hadoop is an open-source framework designed for
storing and processing large datasets in a distributed
computing environment
Features of Hadoop :-
1. Distributed Storage – Hadoop uses HDFS
(Hadoop Distributed File System) to store large
datasets across multiple nodes, ensuring high
availability and fault tolerance.
2. Scalability – It can scale horizontally by adding
more nodes to handle increasing amounts of data
efficiently.
3. Fault Tolerance – Hadoop automatically replicates
data across multiple nodes, ensuring data safety
even if a node fails.
4. Parallel Processing – The MapReduce framework
enables the parallel processing of large datasets,
improving processing speed and efficiency
Q.2 List domain specific features of Hadoop
 Hadoop is an open-source framework designed for
storing and processing large datasets in a distributed
computing environment
Domain specific features of Hadoop are as follows :-
1. Healthcare – Processes large volumes of medical
records, genomic data, and real-time patient
monitoring data for predictive analytics and
disease detection.
2. Finance – Enables fraud detection, risk
management, and real-time transaction analysis by
handling vast amounts of financial data efficiently.
3. E-commerce – Supports recommendation
systems, customer behaviour analysis, and
inventory management by processing big data
from user interactions.
4. Telecommunications – Helps in network
optimization, call detail record analysis, and
predictive maintenance by analyzing large-scale
data traffic patterns

Q.3 Define bda and state it's type


 Big Data Analytics (BDA) is the process of
examining large and complex datasets to uncover
hidden patterns, correlations, trends, and insights for
decision-making.
Types of BDA:-
1. Descriptive Analytics – Summarizes historical
data to understand past trends
2. Diagnostic Analytics – Identifies causes of past
events using data patterns
3. Predictive Analytics – Uses statistical models
and machine learning to forecast future outcomes.
4. Prescriptive Analytics – Recommends actions
based on predictive insights to optimize decision-
making.
Q.4 State different big data stack
 Here are four different Big Data Stacks:
Hadoop Ecosystem Stack – Includes Hadoop
Distributed File System (HDFS), MapReduce, YARN, and
tools like Hive, Pig, and HBase for big data processing
and storage.
Lambda Architecture Stack – Combines batch
processing (Hadoop, Spark) with real-time processing
(Apache Storm, Kafka, Flink) for scalable and fault-
tolerant data analytics.
Kappa Architecture Stack – Focuses on real-time
data processing using tools like Apache Kafka, Apache
Flink, and Apache Samza, eliminating batch layers.
SMACK Stack – Consists of Spark, Mesos, Akka,
Cassandra, and Kafka, providing a real-time, scalable,
and high-performance big data processing solution.

Q.5 State characteristics of big data


 Here are four key characteristics of Big Data:
1. Volume – Refers to the vast amount of data
generated from various sources, such as social
media, sensors, and transactions.
2. Velocity – Represents the speed at which data is
generated, processed, and analyzed in real-time or
near real-time.
3. Variety – Includes different types of data formats,
such as structured (databases), semi-structured
(XML, JSON), and unstructured (videos, images,
text).
4. Veracity – Ensures the reliability and accuracy of
data, dealing with inconsistencies, noise, and
uncertainty in big data sources.

Q.6 Evolution of Hadoop ?



1. Origins (2003–2006): Inspired by Google’s GFS
and MapReduce, Hadoop was created by Doug
Cutting and Mike Cafarella and later open-
sourced under Apache.
2. Growth (2008–2013): Became a top Apache
project, gained industry adoption, and introduced
HDFS, MapReduce, and YARN for better
scalability.
3. Advancements (2017–Present): Hadoop 3.0
added erasure coding, container support, and
cloud integration, improving performance and
efficiency.
4. Future Trends: Focus on AI, real-time
processing, hybrid cloud, and security to
enhance big data analytics.

Q.7 Architecture of Hadoop(draw and list )


 Architecture Mainly consists of 4 components.
 MapReduce
 HDFS(Hadoop Distributed File System)
 YARN(Yet Another Resource Negotiator)
 Common Utilities or Hadoop Common
Q.8 Types of analysis:-
1. Descriptive Analytics – Summarizes historical
data to understand past trends
2. Diagnostic Analytics – Identifies causes of past
events using data patterns
3. Predictive Analytics – Uses statistical models
and machine learning to forecast future outcomes.
4. Prescriptive Analytics – Recommends actions
based on predictive insights to optimize decision-
making.

4 Marks:-
1. Explain data science?
Data science is an interdisciplinary field that involves
using various techniques, algorithms, and systems to
analyze and interpret large sets of data to derive
insights and make informed decisions.
Here are 8 key points about data science:
1. Data Collection and Cleaning: Before any
analysis, data needs to be gathered from different
sources and cleaned to ensure accuracy. Raw data
often contains errors, missing values, or
inconsistencies that need to be addressed.
2. Exploratory Data Analysis (EDA): EDA involves
visualizing and summarizing data to understand
patterns, distributions, and relationships. This step
helps data scientists to uncover hidden insights
and decide on further analysis.
3. Statistical Analysis: Data science heavily relies
on statistics to make inferences, test hypotheses,
and understand the likelihood of certain events or
outcomes. This includes tools like regression
analysis, probability, and hypothesis testing.
4. Machine Learning: Machine learning (ML)
algorithms allow computers to learn from data,
identifying patterns without being explicitly
programmed. Common techniques include
classification, regression, clustering, and decision
trees.
5. Big Data Technologies: Handling vast amounts
of data requires specialized tools like Hadoop,
Spark, and cloud computing resources. These tools
allow for processing, storing, and analyzing data
that exceeds traditional computing power.
6. Data Visualization: Visual representations of
data, such as graphs, charts, and dashboards, are
used to communicate findings clearly. Visualization
helps stakeholders easily interpret complex data
insights.
7. Predictive Modeling: One of the primary goals of
data science is to predict future outcomes based
on historical data. This involves building models
that can forecast trends, behaviors, or risks with a
certain level of confidence.
8. Communication and Decision-Making: Data
science isn't just about analyzing data—it’s also
about communicating the findings to non-technical
stakeholders. Data scientists must be able to
explain their results clearly and help guide
business decisions.
Q.2 Explain analytics flow of big data?
The analytics flow for big data involves several key
stages:
a. Data Collection: Gather data from diverse sources
like sensors, social media, and logs.
b. Data Storage: Store large datasets in scalable
solutions like Hadoop or NoSQL databases.
c. Data Cleaning and Preprocessing: Clean and
transform data to ensure quality and usability.
d. Data Analysis: Conduct exploratory analysis and
summarize trends or patterns using descriptive
analytics.
e. Modeling and Machine Learning: Apply machine
learning techniques to build models that predict or
classify data.
f. Data Visualization and Reporting: Visualize insights
through dashboards and reports for better
decision-making.
g. Deployment and Integration: Deploy models into
production and integrate insights into business
systems.
h. Monitoring and Maintenance: Continuously monitor
models and data pipelines for accuracy and
performance.
This flow ensures that big data is processed,
analyzed, and turned into actionable insights
effectively using advanced technologies.
Q.3 Explain the big data collection process of big
data analytics with an example
Big data collection for analytics involves these key
steps:
1. Identify & Select: Define project goals and choose
relevant data sources (e.g., databases, social
media).
2. Acquire: Gather data using methods like web
scraping, APIs, or data streaming.
3. Store: Use distributed systems (e.g., Hadoop) for
secure and efficient storage.
4. Preprocess: Clean and transform data, handling
missing values and inconsistencies.
5. Govern: Implement policies for data access,
security, and privacy, ensuring compliance.
A retail example: a company wanting to analyze
customer behavior would collect data from
databases, websites, and social media, store it in
Hadoop, clean it, and then analyze it while adhering
to privacy regulations.

Q.4 Describe HDFS.


HDFS (Hadoop Distributed File System) is a
distributed file system designed for large clusters and
high throughput. Key characteristics include:
 Scalability: Handles massive files (GBs to TBs) by
breaking them into blocks and distributing them
across thousands of nodes.
 Replication: Ensures fault tolerance by replicating
data blocks across multiple machines (default is 3
copies of a 64MB block).
 Streaming Access: Optimized for high-
throughput sequential reads and writes, ideal for
batch processing. This trade-off means it's not
suitable for low-latency, interactive access.
 File Appends: While initially designed for
immutable files (write-once), HDFS now supports
appending to files.
Q.5 Compare rdbms vs Hadoop
Feature RDBMS Hadoop
Data Type Primarily structured data (rows and columns) Handles structured, semi-
structured, and unstructured data
Data Volume Designed for smaller to medium-sized datasets Optimized for massive datasets
(Big Data)
Processing Focuses on online transaction processing (OLTP) Emphasizes batch processing for
with fast, consistent transactions complex analysis of large
volumes of data
Scalability Typically scales vertically (more powerful hardware) Scales horizontally (add more
machines to the cluster)
Data Schema Static schema defined upfront Flexible schema, often schema-
on-read

Q.6 Case study on weather forecasting



Weather forecasting is a complex process that relies
on analyzing vast amounts of data from various
sources. Big data analytics has revolutionized this
field, enabling more accurate and timely predictions.
Here's a case study on weather forecasting using big
data:
Data Sources:
 Weather Stations: Collect real-time data on
temperature, humidity, wind speed, and
precipitation.
 Satellites: Provide images and data on cloud
cover, atmospheric conditions, and land surface
temperatures.
 Radar: Detects precipitation and tracks its
movement.
 Aircraft: Gather data on atmospheric conditions at
different altitudes.
 IoT Sensors: A growing network of sensors
provides hyperlocal data on weather conditions.
 Social Media: Can offer insights into real-time
weather events and their impact.
Big Data Technologies:
 Hadoop: Stores and processes massive datasets
from diverse sources.
 Spark: Enables fast and efficient analysis of
weather data.
 Machine Learning: Algorithms identify patterns
and build predictive models.
 Cloud Computing: Provides scalable
infrastructure for data storage and processing.
Analysis and Prediction:
 Data Preprocessing: Cleaning and transforming
raw data to ensure quality.
 Feature Engineering: Extracting relevant
variables for model building.
 Model Development: Training machine learning
models on historical and real-time data.
 Visualization: Presenting weather forecasts
through maps, charts, and reports.
Benefits:
 Improved Accuracy: Big data enhances the
precision of weather forecasts.
 Timely Warnings: Enables early warnings for
severe weather events.
 Enhanced Decision-Making: Helps individuals,
businesses, and governments make informed
choices.
 Better Resource Management: Optimizes
resource allocation for weather-related activities.
Challenges:
 Data Volume and Velocity: Handling the sheer
volume and speed of weather data.
 Data Integration: Combining data from diverse
sources with varying formats.
 Model Complexity: Developing accurate and
reliable predictive models.
 Computational Resources: Requires powerful
computing infrastructure for data processing.

Q.7 Challenges of big data


1. Data Volume and Velocity: The sheer amount of
data being generated is exploding, and it's coming
in faster than ever. This makes it difficult to store,
process, and analyze it all efficiently. Think of
trying to drink from a firehose!
2. Data Variety: Big data comes in all shapes and
sizes - structured (like a spreadsheet), semi-
structured (like a document with some tags), and
unstructured (like social media posts or videos).
This variety makes it tough to integrate and
analyze data from different sources.
3. Data Quality: With so much data, it's easy for
errors, inconsistencies, and duplicates to creep in.
Poor data quality can lead to inaccurate insights
and bad decisions. It's like trying to build a house
with faulty materials.
4. Data Security and Privacy: Big data often
contains sensitive information, so protecting it from
breaches and misuse is crucial. Companies need to
comply with privacy regulations and ensure data is
accessed and used responsibly.
5. Skills Gap: Analyzing big data requires specialized
skills in areas like data science, machine learning,
and statistics. Finding and retaining professionals
with these skills can be a challenge.

You might also like