Bda Answers
Bda Answers
Q1: Define Big Data. Why is Big Data required? How does traditional BI environment
differ from Big Data environment?
Big Data refers to extremely large datasets that may be analyzed computationally to reveal
patterns, trends, and associations, especially relating to human behavior and interactions. Big
Data is defined by the 3Vs: Volume, Velocity, and Variety.
Q3: Define Big Data. Why is Big Data required? Write a note on data warehouse
environment.
Q4: What are the three characteristics of Big Data? Explain the differences between BI
and Data Science.
BI vs Data Science:
Q6: What are the key roles for the New Big Data Ecosystem?
Q8: What is Big Data Analytics? Explain in detail with its example. Big Data Analytics
refers to the process of examining large datasets to uncover hidden patterns, correlations,
market trends, and customer preferences. Example: Netflix uses Big Data to analyze user
preferences and recommend content accordingly.
Q11: Write a short note on data science and data science process. Data Science:
Interdisciplinary field that uses algorithms, data analysis, and ML to extract insights.
Process:
1. Business understanding
2. Data collection
3. Data preparation
4. Modeling
5. Evaluation
6. Deployment
7. Monitoring
Soft State: State of the system may change over time without input.
Eventual Consistency: In distributed systems, data will become consistent over time.
Example: In Amazon DynamoDB, updates propagate eventually to ensure all replicas
converge.
Q13: What are different phases of the Data Analytics Lifecycle? Explain each in detail.
UNIT II
Q1: Explain Analytical Theory and Methods Analytical theory and methods refer to
mathematical and statistical techniques used to extract insights from data. They provide the
foundation for algorithms in machine learning and predictive modeling. Examples include
regression, classification, clustering, and time series analysis.
Steps:
1. Select K initial centroids randomly
2. Assign each data point to the nearest centroid
3. Recalculate centroids as the mean of points in each cluster
4. Repeat steps 2 and 3 until convergence
Applications: Customer segmentation, image compression, pattern recognition
Q3: Write a short note on Diagnostics Diagnostics are tools and techniques used to
evaluate model performance. They help identify model assumptions, residual errors,
multicollinearity, and overfitting. Examples include confusion matrix, ROC curve, and
residual plots.
Q4: Explain Units of Measure Units of measure indicate the scale or quantity in which data
is represented (e.g., dollars, seconds, kilograms). Ensuring consistency in units is critical for
accurate analysis and comparisons.
Data quality
Feature selection
Handling missing values
Model interpretability
Ethical and legal concerns
Q6: What are the Additional Algorithms Apart from basic methods, additional algorithms
include:
Random Forest
Gradient Boosting
Support Vector Machines (SVM)
Neural Networks
K-Nearest Neighbors (KNN)
Q7: Write a short note on linear regression model. Also apply Ordinary Least Squares
(OLS) technique to estimate the parameters. Linear regression models the relationship
between a dependent variable and one or more independent variables using a straight line.
Model: y = β0 + β1x + ε
OLS estimates β0 and β1 by minimizing the sum of squared errors (SSE): β1 = Σ((xi -
x̄)(yi - ȳ)) / Σ(xi - x̄)² β0 = ȳ - β1x̄
Q8: Explain Linear Regression Model with Normally Distributed Errors. In linear
regression, errors (residuals) are assumed to be normally distributed with mean zero and
constant variance. This assumption ensures valid hypothesis testing and confidence intervals.
Q9: What is Logistic regression? Explain in detail. Also explain any two of its
applications. Logistic regression is used for binary classification problems. It models the
probability that a given input belongs to a particular category using the logistic function.
Applications:
1. Predicting disease presence (yes/no)
2. Customer churn prediction
Q10: Describe logistic regression model with respect to logistic function. The logistic
regression model uses the sigmoid function: P(y=1|x) = 1 / (1 + e^-(β0 + β1x)) It outputs
probabilities between 0 and 1, suitable for classification tasks.
Q11: Where decision tree is used? Decision trees are used in:
1. Select the best attribute using a selection criterion (e.g., Information Gain)
2. Split the dataset into subsets
3. Repeat the process for each subset recursively
4. Stop when a stopping condition is met (e.g., pure leaf)
Q13: Write down the ID3 Algorithm. ID3 (Iterative Dichotomiser 3):
Q14: Explain the Bayes' Theorem. Bayes' Theorem: P(A|B) = [P(B|A) * P(A)] / P(B) It
helps in updating the probability of a hypothesis given new evidence.
Logistic Regression
Decision Trees
Naive Bayes
KNN
SVM
Random Forest
Q17: What is sentiment analysis? How it can be carried out? Explain it in detail.
Sentiment analysis identifies and categorizes opinions expressed in text. Steps:
Text preprocessing
Feature extraction (TF-IDF, embeddings)
Classification (positive/negative/neutral) Tools: NLP libraries, ML models, lexicon-
based methods
1. Text preprocessing
2. Feature extraction
3. Apply clustering or topic modeling (e.g., LDA)
4. Interpret and label topics
Q20: Write a short note on decision tree. A decision tree is a flowchart-like structure used
for decision-making. It splits data based on features to arrive at decisions or classifications.
Q21: How to predict whether customers will buy a product or not? Explain with respect
to decision tree. Use customer data as input features. Train a decision tree on past data to
classify outcomes (buy/not buy). The tree will identify patterns and rules.
Q23: Describe additional classification methods other than decision tree and Bayes’
theorem.
Q24: How to model a structure of observations taken over time? Explain with respect to
Time series analysis. Also explain any two of its applications. Time series analysis models
sequential data points. Applications:
Q25: What are the components of time series? Explain each of them. Also write the
main steps of Box-Jenkins methodology for time series analysis. Components:
Box-Jenkins Steps:
1. Model identification
2. Parameter estimation
3. Model checking
Q27: Explain additional time series methods other than Box-Jenkins methodology and
Autoregressive Integrated Moving Average Model.
Exponential Smoothing
Seasonal ARIMA (SARIMA)
Prophet Model
Long Short-Term Memory (LSTM) networks
Q28: What are major challenges with text analysis? Explain with examples.
Ambiguity of language
Sarcasm and irony
Misspellings and slang
Multilingual content Example: "This phone is sick" could be positive or negative
depending on context
1. Text preprocessing
2. Tokenization
3. Feature extraction
4. Classification or clustering
5. Visualization and interpretation
1. Data acquisition
2. Cleaning and transformation
3. Sentiment scoring
4. Classification
5. Decision-making
Q31: What is the use of Regular Expressions? Explain any five regular expressions with
its description and example. Regular expressions are patterns used for string matching.
Examples:
1. \d+ – Matches digits
2. \w+ – Matches words
3. ^ – Start of string
4. $ – End of string
5. [a-zA-Z] – Matches alphabetic characters
Q32: How to normalize the text using tokenization and case folding? Explain in detail.
Also explain about Bag-of-words approach.
Q33: How to retrieve information and applying text analysis? Explain with respect to
Term Frequency. Term Frequency (TF) counts how often a term appears in a document,
helping identify key terms. High TF indicates relevance within a document.
Q34: What is the critical problem in using Term frequency? How can it be fixed?
Problem: Common terms dominate the scores. Solution: Use TF-IDF which reduces weight
of common words.
Q35: How to categorize documents by topics? Explain in detail. Use topic modeling
methods like LDA:
Preprocess text
Extract features
Fit LDA model
Interpret topics from word distributions
Q37: Explain the ARIMA Model technique ARIMA (Auto-Regressive Integrated Moving
Average) captures autocorrelations in time series data. It combines differencing (I), past
values (AR), and past errors (MA).
Q38: Explain the steps involved in Text analysis with example.
Q39: What is tokenization? Explain how it is used in text analysis. Tokenization splits
text into meaningful units (tokens). It's the first step in NLP, allowing further analysis like
sentiment detection or classification.
UNIT III
Q1: Explain the concept of Data Product. A data product is a product that facilitates end
goals through the use of data. It leverages data analysis, statistical modeling, and machine
learning to provide insights, recommendations, or automation. Examples include
recommendation systems, fraud detection models, and predictive maintenance systems.
Q2: How can Hadoop be used to build Data Products at scale? Hadoop enables scalable
storage and processing of big data. It supports building data products by:
Q3: Write a short note on: The Data Science Pipeline The Data Science Pipeline refers to
the sequence of steps followed to extract insights from data. Steps include:
Data collection
Data cleaning
Data exploration
Feature engineering
Model training
Model evaluation
Deployment
Q4: Write a short note on: The Big Data Pipeline The Big Data Pipeline processes large-
scale data through stages such as:
Q5: Write a short note on: Hadoop – The Big Data Operating System Hadoop is
considered the OS of Big Data due to its ability to manage distributed storage and processing
of massive datasets across clusters of computers using simple programming models.
Q8: Explain the concept of Hadoop Cluster A Hadoop cluster is a collection of machines
(nodes) configured to work together to store and process large datasets in a distributed
environment.
Q9: Explain Hadoop Distributed File System (HDFS) HDFS is a distributed file system
designed to store large data sets reliably across many machines. It has a master-slave
architecture with NameNode and DataNodes.
Q12: Explain the concept of Hadoop Streaming Hadoop Streaming allows developers to
use any programming language to write Map and Reduce functions by reading from standard
input and writing to standard output.
Q14: Explain the Basics of Apache Spark Apache Spark is a fast, general-purpose engine
for large-scale data processing. It supports batch and stream processing and provides APIs in
Python, Java, Scala, and R.
Spark Core
Spark SQL
Spark Streaming
MLlib
GraphX
Q16: Explain the concept of Resilient Distributed Datasets (RDDs) RDDs are immutable
distributed collections of objects. They support parallel operations and fault tolerance via
lineage information.
Q17: Write short note on: A typical Spark Application A Spark application includes:
Driver program
Cluster manager
Executors
Tasks
Q18: Write short note on: The Spark Execution Model Spark executes a DAG (Directed
Acyclic Graph) of stages and tasks. Each task runs on an executor and processes a partition of
the data.
Q19: What is data science pipeline? Explain in detail with a neat diagram. [Diagram not
shown] Steps:
1. Data Collection
2. Data Preparation
3. Modeling
4. Evaluation
5. Deployment
6. Monitoring
Q20: How to refactor the data science pipeline into an iterative model? Explain all its
phases with a neat diagram. Refactoring involves incorporating feedback loops and
iterations across all phases, allowing continuous improvement and adaptation to changes.
Scalability
Fault tolerance
High availability
Data locality
Load balancing
Security
Horizontal scalability
HDFS for fault tolerance
YARN for resource management
Data locality through task scheduling
Security through Kerberos
Q24: Explain with a neat diagram a small Hadoop cluster with two master nodes and
four workers nodes that implements all six primary Hadoop services. [Diagram not
shown] The services include:
Q25: Write a short note on Hadoop Distributed File System. [Duplicate of Q9]
Q26: How basic interaction can be done in Hadoop distributed file system? Explain any
five basic file system operations with its appropriate command.
Q27: What are various types of permissions in HDFS? What are different access levels?
Write and explain commands to set various types and access levels. What is a caveat
with file permissions on HDFS? Permissions: Read, Write, Execute Access Levels: User,
Group, Others Command: hdfs dfs -chmod 755 /file Caveat: Permissions are advisory;
actual enforcement may vary depending on security setup.
Q29: How MapReduce can be implemented on a Cluster? Explain all phases with a neat
diagram. Phases:
Input Splitting
Mapping
Shuffling & Sorting
Reducing
Output writing
Q30: Explain the details of data flow in a MapReduce pipeline executed on a cluster of a
few nodes with a neat diagram. Data flows from HDFS -> Mapper -> Combiner ->
Partitioner -> Reducer -> Output
Q31: Write a short note on Job Chaining. Job chaining allows multiple MapReduce jobs to
be connected, where the output of one job is the input of the next.
Q33: Demonstrate the process of Computing on CSV Data with Hadoop Streaming. Use
Python/Perl script to parse CSV and apply logic via standard input/output.
Q34: Demonstrate the process of executing a Streaming job on a Hadoop cluster. Submit
job using Hadoop Streaming JAR, specify mapper and reducer scripts, input, and output.
Q35: Write a short note on Combiners in advanced MapReduce context. Combiners act
as mini-reducers to decrease the volume of data transferred to reducers.
Q37: Write a short note on Job Chaining in advanced MapReduce context. Job chaining
links multiple jobs; output of one job is used as input for the next job.
Q38: Write in brief about Spark. Also write and explain its primary components. Spark
is a distributed computing engine with components:
Spark Core
Spark SQL
Spark Streaming
MLlib
GraphX
UNIT IV
Q1: What are the Distributed Analysis and Patterns? Distributed analysis refers to the
processing and analysis of data spread across multiple nodes in a distributed computing
environment. Common patterns include partitioning, replication, and parallel processing.
Q2: What is Computing with Keys? Computing with keys involves using keys to group and
partition data, enabling parallel computations by distributing the data across multiple
reducers.
Q3: Write a short note on Compound Keys. Compound keys are composed of multiple
fields and are used to uniquely identify records and allow composite sorting and grouping in
distributed systems.
Q5: What is the Identity Pattern? The identity pattern is a MapReduce pattern where the
map or reduce function simply passes data unchanged. Useful for debugging or data
inspection.
Q6: Write a short note on Pairs versus Stripes. Pairs and stripes are two techniques for
computing co-occurrence matrices. Pairs emit a pair of items per record; stripes emit a single
key with a map of co-occurring items.
Q7: Explain different Design Patterns. Design patterns include Summarization, Filtering,
Data Organization, and Meta Patterns. Each addresses a common problem in distributed
computing.
Q9: Explain how keys allow parallel reduction by partitioning the keyspace to multiple
reducers. By assigning a range of keys to different reducers, MapReduce can distribute work
efficiently, allowing for parallel reduction and scaling of large datasets.
Q10: What is the functionality of the explode mapper? Explode mapper transforms a
single input record into multiple output records. Example: A list of tags per post can be
exploded into individual tag-post pairs.
Q11: What is the functionality of the filter mapper? Filter mapper removes records that do
not meet specific criteria. Example: Removing records with null or invalid values.
Q12: What is the functionality of the identity pattern? The identity pattern returns input
data as output without transformation. It’s used for testing and simple data pipeline stages.
Q13: Write briefly about design patterns. Explain each of its category.
Summarization Patterns
Filtering Patterns
Data Organization Patterns
Meta Patterns Each category focuses on structuring data or operations to address a
common analysis problem in distributed environments.
Q14: Data flow for predicting number of comments from news/blog data.
Q17: Drawback of conventional relational approach: Relational databases don’t scale well
with unstructured/big data. Resolved via NoSQL, Hadoop, and distributed processing
systems.
ColumnFilter
RowFilter
ValueFilter Commands involve scan with filter parameters.
Q23: Flume Data Flow Diagram: Agent → Channel → Sink (e.g., HDFS)
Q24: Single-agent Flume data flow: Agent reads Apache logs → Stores into HDFS.
Configuration via flume.conf.
D = GROUP A BY id;
E = JOIN A BY id, B BY id;
Q29: Pig Relational Operators: LOAD, STORE, FILTER, FOREACH, GROUP, JOIN,
ORDER, DISTINCT, UNION, SPLIT
DataFrames
SQL engine
Catalyst optimizer
Tungsten execution engine Diagram: SQL/DF → Catalyst → Logical/Physical Plan
→ Tungsten → Execution