0% found this document useful (0 votes)

23 views18 pages

Bda Answers

Big Data refers to large datasets analyzed for patterns and insights, defined by Volume, Velocity, and Variety. It is essential for improved decision-making and operational efficiency, contrasting with traditional BI which focuses on structured data and is less scalable. Key challenges include data management, quality, security, and the need for skilled professionals.

Uploaded by

Sahil Sayyad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views18 pages

Bda Answers

Uploaded by

Sahil Sayyad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

UNIT I

Q1: Define Big Data. Why is Big Data required? How does traditional BI environment
differ from Big Data environment?

Big Data refers to extremely large datasets that may be analyzed computationally to reveal
patterns, trends, and associations, especially relating to human behavior and interactions. Big
Data is defined by the 3Vs: Volume, Velocity, and Variety.

Why is Big Data Required?

 To analyze large volumes of structured, semi-structured, and unstructured data.

 To gain deeper insights for better decision-making.
 To enhance operational efficiency and develop new products.

Traditional BI vs Big Data Environment:

Aspect Traditional BI Big Data Environment

Data Structured Structured, Semi-structured, Unstructured
Volume GBs to TBs TBs to ZBs
Storage Centralized Distributed (HDFS)
Processing Batch Real-time, Batch
Cost High Cost-effective (Open-source)
Scalability Vertical Horizontal

Q2: What are the challenges with Big Data?

1. Volume: Managing huge datasets.

2. Velocity: Handling streaming and real-time data.
3. Variety: Integrating multiple formats.
4. Veracity: Ensuring data quality.
5. Security & Privacy: Securing sensitive data.
6. Integration: Combining diverse sources.
7. Scalability: Efficient system scaling.
8. Skill Gap: Lack of trained professionals.
9. Governance: Proper metadata and lineage tracking.
10. Cost: Managing infrastructure cost-effectively.

Q3: Define Big Data. Why is Big Data required? Write a note on data warehouse
environment.

(Definition and requirement same as Q1)

Data Warehouse Environment: A Data Warehouse is a centralized repository for storing
structured data from multiple sources. It supports business intelligence (BI) activities like
querying and reporting. It follows ETL (Extract, Transform, Load) process and typically uses
relational databases. It is optimized for read-heavy operations and is schema-based.

Q4: What are the three characteristics of Big Data? Explain the differences between BI
and Data Science.

Three Characteristics (3Vs):

1. Volume: Refers to vast amounts of data.

2. Velocity: Speed of data generation and processing.
3. Variety: Diversity in data formats and sources.

BI vs Data Science:

Feature BI Data Science

Focus Descriptive Predictive/Prescriptive
Tools SQL, OLAP Python, R, ML Libraries
Data Historical Real-time, Historical
Goal Insight Forecasting, Optimization
User Analysts Data Scientists

Q5: Describe the current analytical architecture for data scientists.

Current architecture includes:

 Data Sources: Structured (databases), unstructured (logs, media).

 Data Storage: Data lakes, HDFS, cloud storage.
 Data Processing: Spark, Hadoop, Kafka.
 Analytics Layer: ML/AI models using Python, R, TensorFlow.
 Visualization: Tableau, Power BI, matplotlib.
 Deployment: APIs, dashboards, embedded systems.

Q6: What are the key roles for the New Big Data Ecosystem?

1. Data Engineer: Manages data pipelines.

2. Data Scientist: Builds models.
3. ML Engineer: Deploys AI models.
4. Business Analyst: Translates insights.
5. Data Architect: Designs architecture.
6. DevOps: Handles deployment and monitoring.
Q7: What are key skill sets and behavioral characteristics of a data scientist?

 Skills: Programming (Python, R), Statistics, ML, Data wrangling, Visualization.

 Tools: SQL, Hadoop, Spark, Jupyter.
 Behavioral Traits: Curiosity, critical thinking, communication, business acumen.

Q8: What is Big Data Analytics? Explain in detail with its example. Big Data Analytics
refers to the process of examining large datasets to uncover hidden patterns, correlations,
market trends, and customer preferences. Example: Netflix uses Big Data to analyze user
preferences and recommend content accordingly.

Q9: Write a short note on Classification of Analytics.

1. Descriptive Analytics: What happened? (e.g., dashboards)

2. Diagnostic Analytics: Why did it happen? (e.g., root cause analysis)
3. Predictive Analytics: What will happen? (e.g., sales forecasting)
4. Prescriptive Analytics: What should be done? (e.g., optimization models)

Q10: Describe the Challenges of Big Data. (Already answered in Q2)

Q11: Write a short note on data science and data science process. Data Science:
Interdisciplinary field that uses algorithms, data analysis, and ML to extract insights.

Process:

1. Business understanding
2. Data collection
3. Data preparation
4. Modeling
5. Evaluation
6. Deployment
7. Monitoring

Q12: Write a short note on soft state eventual consistency.

 Soft State: State of the system may change over time without input.
 Eventual Consistency: In distributed systems, data will become consistent over time.
Example: In Amazon DynamoDB, updates propagate eventually to ensure all replicas
converge.

Q13: What are different phases of the Data Analytics Lifecycle? Explain each in detail.

1. Discovery: Understand business domain, define problem.

2. Data Preparation: Clean, transform, and prepare data.
3. Model Planning: Select techniques (e.g., regression, clustering).
4. Model Building: Develop models using tools like R, Python.
5. Communicate Results: Visualize and present to stakeholders.
6. Operationalize: Deploy model into production.
7. Monitoring and Maintenance: Evaluate and update models regularly

UNIT II

Q1: Explain Analytical Theory and Methods Analytical theory and methods refer to
mathematical and statistical techniques used to extract insights from data. They provide the
foundation for algorithms in machine learning and predictive modeling. Examples include
regression, classification, clustering, and time series analysis.

Q2: Write in detail concept of K-means. K-means is an unsupervised clustering algorithm

that partitions data into K clusters.

 Steps:
1. Select K initial centroids randomly
2. Assign each data point to the nearest centroid
3. Recalculate centroids as the mean of points in each cluster
4. Repeat steps 2 and 3 until convergence
 Applications: Customer segmentation, image compression, pattern recognition

Q3: Write a short note on Diagnostics Diagnostics are tools and techniques used to
evaluate model performance. They help identify model assumptions, residual errors,
multicollinearity, and overfitting. Examples include confusion matrix, ROC curve, and
residual plots.

Q4: Explain Units of Measure Units of measure indicate the scale or quantity in which data
is represented (e.g., dollars, seconds, kilograms). Ensuring consistency in units is critical for
accurate analysis and comparisons.

Q5: What is meant by Additional Considerations Additional considerations refer to

supplementary factors influencing model performance or implementation such as:

 Data quality
 Feature selection
 Handling missing values
 Model interpretability
 Ethical and legal concerns

Q6: What are the Additional Algorithms Apart from basic methods, additional algorithms
include:

 Random Forest
 Gradient Boosting
 Support Vector Machines (SVM)
 Neural Networks
 K-Nearest Neighbors (KNN)

Q7: Write a short note on linear regression model. Also apply Ordinary Least Squares
(OLS) technique to estimate the parameters. Linear regression models the relationship
between a dependent variable and one or more independent variables using a straight line.

 Model: y = β0 + β1x + ε
 OLS estimates β0 and β1 by minimizing the sum of squared errors (SSE): β1 = Σ((xi -
x̄)(yi - ȳ)) / Σ(xi - x̄)² β0 = ȳ - β1x̄

Q8: Explain Linear Regression Model with Normally Distributed Errors. In linear
regression, errors (residuals) are assumed to be normally distributed with mean zero and
constant variance. This assumption ensures valid hypothesis testing and confidence intervals.
Q9: What is Logistic regression? Explain in detail. Also explain any two of its
applications. Logistic regression is used for binary classification problems. It models the
probability that a given input belongs to a particular category using the logistic function.

 Applications:
1. Predicting disease presence (yes/no)
2. Customer churn prediction

Q10: Describe logistic regression model with respect to logistic function. The logistic
regression model uses the sigmoid function: P(y=1|x) = 1 / (1 + e^-(β0 + β1x)) It outputs
probabilities between 0 and 1, suitable for classification tasks.

Q11: Where decision tree is used? Decision trees are used in:

 Classification and regression tasks

 Risk analysis
 Customer segmentation
 Medical diagnosis

Q12: Explain the General Algorithm of Decision Tree.

1. Select the best attribute using a selection criterion (e.g., Information Gain)
2. Split the dataset into subsets
3. Repeat the process for each subset recursively
4. Stop when a stopping condition is met (e.g., pure leaf)

Q13: Write down the ID3 Algorithm. ID3 (Iterative Dichotomiser 3):

1. Calculate entropy for dataset

2. For each attribute, calculate information gain
3. Choose attribute with highest gain as root
4. Repeat for each branch until stopping criteria is met

Q14: Explain the Bayes' Theorem. Bayes' Theorem: P(A|B) = [P(B|A) * P(A)] / P(B) It
helps in updating the probability of a hypothesis given new evidence.

Q15: What are Diagnostics of Classifiers? Diagnostics assess classifier performance:

 Confusion Matrix
 Precision, Recall, F1 Score
 ROC Curve
 AUC (Area Under Curve)

Q16: Give a brief account Classification Methods used in data analytics.

 Logistic Regression
 Decision Trees
 Naive Bayes
 KNN
 SVM
 Random Forest

Q17: What is sentiment analysis? How it can be carried out? Explain it in detail.
Sentiment analysis identifies and categorizes opinions expressed in text. Steps:

 Text preprocessing
 Feature extraction (TF-IDF, embeddings)
 Classification (positive/negative/neutral) Tools: NLP libraries, ML models, lexicon-
based methods

Q18: Explain the steps involved in categorizing documents by topic.

1. Text preprocessing
2. Feature extraction
3. Apply clustering or topic modeling (e.g., LDA)
4. Interpret and label topics

Q19: Write a note on determining sentiments of documents using text analysis.

Sentiments are identified using NLP techniques by analyzing tone, words, and context.
Techniques include lexicon-based scoring and supervised ML classifiers.

Q20: Write a short note on decision tree. A decision tree is a flowchart-like structure used
for decision-making. It splits data based on features to arrive at decisions or classifications.
Q21: How to predict whether customers will buy a product or not? Explain with respect
to decision tree. Use customer data as input features. Train a decision tree on past data to
classify outcomes (buy/not buy). The tree will identify patterns and rules.

Q22: Explain a probabilistic classification method based on Naive Bayes' theorem.

Naive Bayes assumes independence between features. It calculates posterior probabilities for
each class and assigns the highest. P(Class|Data) ∝ P(Data|Class) * P(Class)

Q23: Describe additional classification methods other than decision tree and Bayes’
theorem.

 K-Nearest Neighbors (KNN)

 Support Vector Machines (SVM)
 Random Forests
 Neural Networks

Q24: How to model a structure of observations taken over time? Explain with respect to
Time series analysis. Also explain any two of its applications. Time series analysis models
sequential data points. Applications:

1. Stock price prediction

2. Sales forecasting

Q25: What are the components of time series? Explain each of them. Also write the
main steps of Box-Jenkins methodology for time series analysis. Components:

 Trend: Long-term progression

 Seasonality: Regular patterns
 Cyclicity: Irregular cycles
 Noise: Random variation

Box-Jenkins Steps:

1. Model identification
2. Parameter estimation
3. Model checking

Q26: Explain Autoregressive Integrated Moving Average Model in detail.

ARIMA(p,d,q) combines:
 AR (p): Autoregressive terms
 I (d): Differencing to make series stationary
 MA (q): Moving average of past errors

Q27: Explain additional time series methods other than Box-Jenkins methodology and
Autoregressive Integrated Moving Average Model.

 Exponential Smoothing
 Seasonal ARIMA (SARIMA)
 Prophet Model
 Long Short-Term Memory (LSTM) networks

Q28: What are major challenges with text analysis? Explain with examples.

 Ambiguity of language
 Sarcasm and irony
 Misspellings and slang
 Multilingual content Example: "This phone is sick" could be positive or negative
depending on context

Q29: What are various text analysis steps? Explain in detail.

1. Text preprocessing
2. Tokenization
3. Feature extraction
4. Classification or clustering
5. Visualization and interpretation

Q30: Describe ACME's Text Analysis Process.

1. Data acquisition
2. Cleaning and transformation
3. Sentiment scoring
4. Classification
5. Decision-making

Q31: What is the use of Regular Expressions? Explain any five regular expressions with
its description and example. Regular expressions are patterns used for string matching.
Examples:
1. \d+ – Matches digits
2. \w+ – Matches words
3. ^ – Start of string
4. $ – End of string
5. [a-zA-Z] – Matches alphabetic characters

Q32: How to normalize the text using tokenization and case folding? Explain in detail.
Also explain about Bag-of-words approach.

 Tokenization: Splitting text into words

 Case Folding: Converting to lowercase
 Bag-of-Words: Represents text by word frequency without order

Q33: How to retrieve information and applying text analysis? Explain with respect to
Term Frequency. Term Frequency (TF) counts how often a term appears in a document,
helping identify key terms. High TF indicates relevance within a document.

Q34: What is the critical problem in using Term frequency? How can it be fixed?
Problem: Common terms dominate the scores. Solution: Use TF-IDF which reduces weight
of common words.

Q35: How to categorize documents by topics? Explain in detail. Use topic modeling
methods like LDA:

 Preprocess text
 Extract features
 Fit LDA model
 Interpret topics from word distributions

Q36: Write a note on Box-Jenkins Methodology Box-Jenkins is a method to identify,

estimate, and check ARIMA models for time series forecasting.

Q37: Explain the ARIMA Model technique ARIMA (Auto-Regressive Integrated Moving
Average) captures autocorrelations in time series data. It combines differencing (I), past
values (AR), and past errors (MA).
Q38: Explain the steps involved in Text analysis with example.

1. Collect data (e.g., customer reviews)

2. Clean data (remove stop words, punctuations)
3. Tokenize and normalize text
4. Extract features (TF-IDF)
5. Apply model (e.g., sentiment analysis)
6. Interpret output

Q39: What is tokenization? Explain how it is used in text analysis. Tokenization splits
text into meaningful units (tokens). It's the first step in NLP, allowing further analysis like
sentiment detection or classification.

Q40: Describe the Term Frequency-Inverse Document Frequency method. TF-IDF = TF

* IDF

 TF: Term frequency in a document

 IDF: Inverse of number of documents the term appears in TF-IDF highlights
important words in a document while reducing the influence of common terms.

UNIT III

Q1: Explain the concept of Data Product. A data product is a product that facilitates end
goals through the use of data. It leverages data analysis, statistical modeling, and machine
learning to provide insights, recommendations, or automation. Examples include
recommendation systems, fraud detection models, and predictive maintenance systems.

Q2: How can Hadoop be used to build Data Products at scale? Hadoop enables scalable
storage and processing of big data. It supports building data products by:

 Storing large volumes of data using HDFS.

 Processing data using MapReduce or Spark.
 Enabling real-time analytics through tools like Hive and Pig.
 Providing fault-tolerance and high availability in distributed environments.

Q3: Write a short note on: The Data Science Pipeline The Data Science Pipeline refers to
the sequence of steps followed to extract insights from data. Steps include:

 Data collection
 Data cleaning
 Data exploration
 Feature engineering
 Model training
 Model evaluation
 Deployment

Q4: Write a short note on: The Big Data Pipeline The Big Data Pipeline processes large-
scale data through stages such as:

 Data ingestion (e.g., Kafka)

 Storage (e.g., HDFS, NoSQL)
 Processing (e.g., Spark, MapReduce)
 Analytics and visualization (e.g., Tableau)

Q5: Write a short note on: Hadoop – The Big Data Operating System Hadoop is
considered the OS of Big Data due to its ability to manage distributed storage and processing
of massive datasets across clusters of computers using simple programming models.

Q6: Explain Hadoop Architecture Hadoop architecture includes:

 HDFS: For distributed storage

 MapReduce: For data processing
 YARN: Resource management layer
 Hadoop Common: Utilities for Hadoop modules

Q7: Explain the different master and worker services in Hadoop

 Master Services: NameNode (manages metadata), ResourceManager (allocates

resources)
 Worker Services: DataNode (stores data), NodeManager (executes tasks)

Q8: Explain the concept of Hadoop Cluster A Hadoop cluster is a collection of machines
(nodes) configured to work together to store and process large datasets in a distributed
environment.

Q9: Explain Hadoop Distributed File System (HDFS) HDFS is a distributed file system
designed to store large data sets reliably across many machines. It has a master-slave
architecture with NameNode and DataNodes.

Q10: Explain the concept of MapReduce MapReduce is a programming model for

processing large datasets. It consists of two steps:

 Map: Converts input into key-value pairs.

 Reduce: Aggregates key-value pairs to produce final output.

Q11: Explain the MapReduce Framework It includes:

 InputFormat: Splits data into chunks

 Mapper: Processes input
 Combiner: Optimizes network usage
 Partitioner: Determines reducer for a key
 Reducer: Produces output
 OutputFormat: Writes output to storage

Q12: Explain the concept of Hadoop Streaming Hadoop Streaming allows developers to
use any programming language to write Map and Reduce functions by reading from standard
input and writing to standard output.

Q13: Advanced MapReduce Concepts:

 a) Combiners: Local reducers that minimize data transfer.

 b) Partitioners: Control distribution of keys to reducers.
 c) Job Chaining: Series of MapReduce jobs executed sequentially.

Q14: Explain the Basics of Apache Spark Apache Spark is a fast, general-purpose engine
for large-scale data processing. It supports batch and stream processing and provides APIs in
Python, Java, Scala, and R.

Q15: Explain the components of Spark Stack

 Spark Core
 Spark SQL
 Spark Streaming
 MLlib
 GraphX

Q16: Explain the concept of Resilient Distributed Datasets (RDDs) RDDs are immutable
distributed collections of objects. They support parallel operations and fault tolerance via
lineage information.

Q17: Write short note on: A typical Spark Application A Spark application includes:

 Driver program
 Cluster manager
 Executors
 Tasks

Q18: Write short note on: The Spark Execution Model Spark executes a DAG (Directed
Acyclic Graph) of stages and tasks. Each task runs on an executor and processes a partition of
the data.

Q19: What is data science pipeline? Explain in detail with a neat diagram. [Diagram not
shown] Steps:

1. Data Collection
2. Data Preparation
3. Modeling
4. Evaluation
5. Deployment
6. Monitoring
Q20: How to refactor the data science pipeline into an iterative model? Explain all its
phases with a neat diagram. Refactoring involves incorporating feedback loops and
iterations across all phases, allowing continuous improvement and adaptation to changes.

Q21: List the requirements of distributed system in order to perform computation at

scale.

 Scalability
 Fault tolerance
 High availability
 Data locality
 Load balancing
 Security

Q22: How Hadoop addresses these requirements?

 Horizontal scalability
 HDFS for fault tolerance
 YARN for resource management
 Data locality through task scheduling
 Security through Kerberos

Q23: Write a short note on Hadoop architecture. [Duplicate of Q6]

Q24: Explain with a neat diagram a small Hadoop cluster with two master nodes and
four workers nodes that implements all six primary Hadoop services. [Diagram not
shown] The services include:

 NameNode, Secondary NameNode, ResourceManager (Master)

 DataNode, NodeManager, JobHistoryServer (Worker)

Q25: Write a short note on Hadoop Distributed File System. [Duplicate of Q9]

Q26: How basic interaction can be done in Hadoop distributed file system? Explain any
five basic file system operations with its appropriate command.

1. hdfs dfs -ls / — List files

2. hdfs dfs -put localfile / — Upload file
3. hdfs dfs -get /file localfile — Download file
4. hdfs dfs -rm /file — Delete file
5. hdfs dfs -mkdir /dir — Create directory

Q27: What are various types of permissions in HDFS? What are different access levels?
Write and explain commands to set various types and access levels. What is a caveat
with file permissions on HDFS? Permissions: Read, Write, Execute Access Levels: User,
Group, Others Command: hdfs dfs -chmod 755 /file Caveat: Permissions are advisory;
actual enforcement may vary depending on security setup.

Q28: Explain functionality of map() and reduce() with a neat diagram.

 map(): Processes input and emits key-value pairs.
 reduce(): Aggregates key-value pairs to produce output.

Q29: How MapReduce can be implemented on a Cluster? Explain all phases with a neat
diagram. Phases:

 Input Splitting
 Mapping
 Shuffling & Sorting
 Reducing
 Output writing

Q30: Explain the details of data flow in a MapReduce pipeline executed on a cluster of a
few nodes with a neat diagram. Data flows from HDFS -> Mapper -> Combiner ->
Partitioner -> Reducer -> Output

Q31: Write a short note on Job Chaining. Job chaining allows multiple MapReduce jobs to
be connected, where the output of one job is the input of the next.

Q32: Demonstrate the process of Hadoop streaming in a MapReduce context. Use

streaming utility:

hadoop jar hadoop-streaming.jar \

-input input_dir \
-output output_dir \
-mapper /bin/cat \
-reducer /usr/bin/wc

Q33: Demonstrate the process of Computing on CSV Data with Hadoop Streaming. Use
Python/Perl script to parse CSV and apply logic via standard input/output.

Q34: Demonstrate the process of executing a Streaming job on a Hadoop cluster. Submit
job using Hadoop Streaming JAR, specify mapper and reducer scripts, input, and output.

Q35: Write a short note on Combiners in advanced MapReduce context. Combiners act
as mini-reducers to decrease the volume of data transferred to reducers.

Q36: Write a short note on Partitioners in advanced MapReduce context. Partitioners

determine how intermediate keys are assigned to reducers.

Q37: Write a short note on Job Chaining in advanced MapReduce context. Job chaining
links multiple jobs; output of one job is used as input for the next job.

Q38: Write in brief about Spark. Also write and explain its primary components. Spark
is a distributed computing engine with components:

 Spark Core
 Spark SQL
 Spark Streaming
 MLlib
 GraphX
UNIT IV

Q1: What are the Distributed Analysis and Patterns? Distributed analysis refers to the
processing and analysis of data spread across multiple nodes in a distributed computing
environment. Common patterns include partitioning, replication, and parallel processing.

Q2: What is Computing with Keys? Computing with keys involves using keys to group and
partition data, enabling parallel computations by distributing the data across multiple
reducers.

Q3: Write a short note on Compound Keys. Compound keys are composed of multiple
fields and are used to uniquely identify records and allow composite sorting and grouping in
distributed systems.

Q4: Explain Compound Data Serialization. Compound data serialization involves

encoding complex data structures into a format suitable for transmission or storage. Examples
include Avro, Thrift, and Protocol Buffers.

Q5: What is the Identity Pattern? The identity pattern is a MapReduce pattern where the
map or reduce function simply passes data unchanged. Useful for debugging or data
inspection.

Q6: Write a short note on Pairs versus Stripes. Pairs and stripes are two techniques for
computing co-occurrence matrices. Pairs emit a pair of items per record; stripes emit a single
key with a map of co-occurring items.

Q7: Explain different Design Patterns. Design patterns include Summarization, Filtering,
Data Organization, and Meta Patterns. Each addresses a common problem in distributed
computing.

Q8: What is meant by Summarization? Summarization is aggregating data to compute

totals, averages, or other statistics, often as a pre-step in analysis.

Q9: Explain how keys allow parallel reduction by partitioning the keyspace to multiple
reducers. By assigning a range of keys to different reducers, MapReduce can distribute work
efficiently, allowing for parallel reduction and scaling of large datasets.

Q10: What is the functionality of the explode mapper? Explode mapper transforms a
single input record into multiple output records. Example: A list of tags per post can be
exploded into individual tag-post pairs.

Q11: What is the functionality of the filter mapper? Filter mapper removes records that do
not meet specific criteria. Example: Removing records with null or invalid values.

Q12: What is the functionality of the identity pattern? The identity pattern returns input
data as output without transformation. It’s used for testing and simple data pipeline stages.

Q13: Write briefly about design patterns. Explain each of its category.
 Summarization Patterns
 Filtering Patterns
 Data Organization Patterns
 Meta Patterns Each category focuses on structuring data or operations to address a
common analysis problem in distributed environments.

Q14: Data flow for predicting number of comments from news/blog data.

 Ingest data → Clean/Tokenize → Feature extraction → Train model → Predict

comment count → Visualize

Q15: Hive Query Language Commands: i) cd $HIVE_HOME ii) CREATE DATABASE

news_data; iii) CREATE TABLE posts (id INT, title STRING); iv) LOAD DATA LOCAL
INPATH 'posts.csv' INTO TABLE posts; v) SELECT COUNT(*) FROM posts; vi) EXIT;

Q16: Data analysis commands in Hive (Examples):

1. SELECT * FROM posts LIMIT 10;

2. SELECT COUNT(*) FROM posts WHERE id > 100;
3. SELECT title FROM posts WHERE title LIKE '%AI%';

Q17: Drawback of conventional relational approach: Relational databases don’t scale well
with unstructured/big data. Resolved via NoSQL, Hadoop, and distributed processing
systems.

Q18: HBase Schema Example:

create 'blog', 'info'

put 'blog', 'row1', 'info:title', 'Big Data'
get 'blog', 'row1'

Q19: Types of HBase filters:

 ColumnFilter
 RowFilter
 ValueFilter Commands involve scan with filter parameters.

Q20: Import MySQL to HDFS:

sqoop import --connect jdbc:mysql://localhost/db --username user --password

pass --table table_name --target-dir /hdfs_dir

Q21: Import MySQL to Hive:

sqoop import --connect jdbc:mysql://localhost/db --username user --password

pass --table table_name --hive-import --create-hive-table --hive-table
hive_table

Q22: Import MySQL to HBase:

sqoop import --connect jdbc:mysql://localhost/db --username user --password
pass --table table_name --hbase-table hbase_table --column-family cf

Q23: Flume Data Flow Diagram: Agent → Channel → Sink (e.g., HDFS)

Q24: Single-agent Flume data flow: Agent reads Apache logs → Stores into HDFS.
Configuration via flume.conf.

Q25: Relations, Tuples, Filtering in Pig:

A = LOAD 'data.txt' AS (id:int, name:chararray);

B = FILTER A BY id > 100;

Q26: Projection in Pig:

C = FOREACH A GENERATE name;

Q27: Grouping and Joining in Pig:

D = GROUP A BY id;
E = JOIN A BY id, B BY id;

Q28: Storing and Outputting in Pig:

STORE A INTO 'output' USING PigStorage(',');

Q29: Pig Relational Operators: LOAD, STORE, FILTER, FOREACH, GROUP, JOIN,
ORDER, DISTINCT, UNION, SPLIT

Q30: Spark SQL Interface Architecture: Consists of:

 DataFrames
 SQL engine
 Catalyst optimizer
 Tungsten execution engine Diagram: SQL/DF → Catalyst → Logical/Physical Plan
→ Tungsten → Execution

500 Data Science Interview Questions and Answers - Vamsee Puligadda PDF
75% (8)
500 Data Science Interview Questions and Answers - Vamsee Puligadda PDF
141 pages
Company Wise Data Science Interview Questions
100% (2)
Company Wise Data Science Interview Questions
39 pages
100 Data Science Interview Questions and Answers
No ratings yet
100 Data Science Interview Questions and Answers
33 pages
C++ - MFC Bible
100% (4)
C++ - MFC Bible
2,672 pages
Fractal Analytics: Orange Cabs Case Study
No ratings yet
Fractal Analytics: Orange Cabs Case Study
11 pages
Po Acct Generator Customization
No ratings yet
Po Acct Generator Customization
28 pages
Presentation Guaranteed Restore Points
No ratings yet
Presentation Guaranteed Restore Points
25 pages
Big Data (Imp-Questions)
No ratings yet
Big Data (Imp-Questions)
17 pages
Dsbda Prelim QB Solution
No ratings yet
Dsbda Prelim QB Solution
11 pages
DA (All CHP.)
No ratings yet
DA (All CHP.)
14 pages
Ds Revision 1
No ratings yet
Ds Revision 1
5 pages
Big Data Analytics Exam Answers Cleaned
No ratings yet
Big Data Analytics Exam Answers Cleaned
4 pages
Long Answered Questions With Answer
No ratings yet
Long Answered Questions With Answer
6 pages
Top Data Science Interview Questions and Answers in 2023 PDF
100% (1)
Top Data Science Interview Questions and Answers in 2023 PDF
14 pages
Big Data Questions Answers
No ratings yet
Big Data Questions Answers
2 pages
Dsbda 4
No ratings yet
Dsbda 4
16 pages
Data Science
No ratings yet
Data Science
14 pages
Data Science
100% (1)
Data Science
7 pages
Assignment 1 Based On Unit 1
No ratings yet
Assignment 1 Based On Unit 1
6 pages
Big Data & Business Analytics 2021 Q&a
No ratings yet
Big Data & Business Analytics 2021 Q&a
4 pages
Da CH1 Slqa
No ratings yet
Da CH1 Slqa
6 pages
BA Questions
No ratings yet
BA Questions
5 pages
Question Bank For All 5 Units: Department of Computer Science and Engineering & Department of Information Technology
No ratings yet
Question Bank For All 5 Units: Department of Computer Science and Engineering & Department of Information Technology
14 pages
Question Bank (DA) - 1
No ratings yet
Question Bank (DA) - 1
14 pages
Sfds Aat
No ratings yet
Sfds Aat
8 pages
Data Science
No ratings yet
Data Science
10 pages
Summary DS231
No ratings yet
Summary DS231
11 pages
Ixs8h l8mgc
No ratings yet
Ixs8h l8mgc
40 pages
Data Science Tool Box Important Viva Question
No ratings yet
Data Science Tool Box Important Viva Question
14 pages
DS
No ratings yet
DS
7 pages
Big Data Unit1 Long Answers
No ratings yet
Big Data Unit1 Long Answers
7 pages
Ads Imp Qna 2025 15 04 06 06 35
No ratings yet
Ads Imp Qna 2025 15 04 06 06 35
33 pages
Data Science
No ratings yet
Data Science
28 pages
Da #2
No ratings yet
Da #2
1 page
Question Bank With Answers
No ratings yet
Question Bank With Answers
103 pages
DA - AKTU Short Answer + Differences
No ratings yet
DA - AKTU Short Answer + Differences
42 pages
Data Science
No ratings yet
Data Science
10 pages
Data Science Questions
No ratings yet
Data Science Questions
5 pages
ml2 250401 105339
No ratings yet
ml2 250401 105339
10 pages
It 6001 Da 2 Marks With Answer PDF
No ratings yet
It 6001 Da 2 Marks With Answer PDF
10 pages
Data Science Interview Qna
No ratings yet
Data Science Interview Qna
5 pages
12 2marks With Ans
No ratings yet
12 2marks With Ans
21 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
31 pages
Data Science MCQs Sample Mid2xlsx 2024 11-29-23!19!54
No ratings yet
Data Science MCQs Sample Mid2xlsx 2024 11-29-23!19!54
8 pages
Unit 1 Answer Key
No ratings yet
Unit 1 Answer Key
2 pages
Da QB F1 Cse It 24 25
No ratings yet
Da QB F1 Cse It 24 25
16 pages
Data Science Assignment
No ratings yet
Data Science Assignment
9 pages
Data Science Unit 1 Notes
No ratings yet
Data Science Unit 1 Notes
30 pages
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
DA QnBank Full 17jan22 NoKey
No ratings yet
DA QnBank Full 17jan22 NoKey
16 pages
AssignmentQuestion4Bigdata 2025
No ratings yet
AssignmentQuestion4Bigdata 2025
2 pages
Dsbda May Solved 2022
No ratings yet
Dsbda May Solved 2022
22 pages
25 Important Data Science Interview Questions 1719736087
No ratings yet
25 Important Data Science Interview Questions 1719736087
15 pages
15 Unit Wise Questions
No ratings yet
15 Unit Wise Questions
2 pages
BI 4thchap
No ratings yet
BI 4thchap
19 pages
BD Question Bank MCQ Answered
No ratings yet
BD Question Bank MCQ Answered
8 pages
Data Science Interview Best
No ratings yet
Data Science Interview Best
48 pages
2marks With Answers
No ratings yet
2marks With Answers
10 pages
Ds Viva
No ratings yet
Ds Viva
9 pages
Dsbda Ut4
No ratings yet
Dsbda Ut4
12 pages
100 Data Science Interview Questions and Answers (General)
100% (1)
100 Data Science Interview Questions and Answers (General)
11 pages
Data Science and Analytics Reviewer
No ratings yet
Data Science and Analytics Reviewer
5 pages
Data Science
No ratings yet
Data Science
31 pages
UNIT 1 Practice Quiz - MCQs - ML
100% (1)
UNIT 1 Practice Quiz - MCQs - ML
10 pages
HPC 2025
No ratings yet
HPC 2025
16 pages
Big Data Data Science QA Detailed
No ratings yet
Big Data Data Science QA Detailed
2 pages
1) Write A Short Note On Nosql
No ratings yet
1) Write A Short Note On Nosql
9 pages
MongoDB Detailed Answers
No ratings yet
MongoDB Detailed Answers
3 pages
DA Resume
No ratings yet
DA Resume
2 pages
721482177-Data-Analyst-Internship-Certificate 2025
No ratings yet
721482177-Data-Analyst-Internship-Certificate 2025
1 page
NGD Unit 1-4
No ratings yet
NGD Unit 1-4
43 pages
Ch7-Image Segmentation (E-Next - In)
No ratings yet
Ch7-Image Segmentation (E-Next - In)
27 pages
NGD Practical Edited 1
No ratings yet
NGD Practical Edited 1
36 pages
Voucher Hasna Hotspot 2
No ratings yet
Voucher Hasna Hotspot 2
4 pages
Act Lab 2
No ratings yet
Act Lab 2
7 pages
Encase
No ratings yet
Encase
276 pages
News
No ratings yet
News
442 pages
Vipul Rajgor Resume
No ratings yet
Vipul Rajgor Resume
2 pages
Why I Don't Like The Test Pyramid
No ratings yet
Why I Don't Like The Test Pyramid
5 pages
splk-1003 4
No ratings yet
splk-1003 4
7 pages
Java Priority Queue
No ratings yet
Java Priority Queue
2 pages
ISTQB Sample Question Paper - 7
No ratings yet
ISTQB Sample Question Paper - 7
5 pages
Unit 5 - Assignment 2 Frontsheet - Security
No ratings yet
Unit 5 - Assignment 2 Frontsheet - Security
28 pages
Cyber Resilience Assessment Framework
No ratings yet
Cyber Resilience Assessment Framework
21 pages
KCS713 Cloud Computing
No ratings yet
KCS713 Cloud Computing
42 pages
Process Builder
No ratings yet
Process Builder
5 pages
AWS Solutions Architect Syllabus
No ratings yet
AWS Solutions Architect Syllabus
3 pages
4.49 Final TYBSc IT Syllabus PDF
No ratings yet
4.49 Final TYBSc IT Syllabus PDF
93 pages
Rest of The Ip Project
No ratings yet
Rest of The Ip Project
26 pages
Fortinet Secure-Sdwan-7.0-Concept-Guide
No ratings yet
Fortinet Secure-Sdwan-7.0-Concept-Guide
19 pages
How To Shrink XFS Partition For The Root Filesystem - 1 Easy Guide
No ratings yet
How To Shrink XFS Partition For The Root Filesystem - 1 Easy Guide
1 page
BTT Brochure
No ratings yet
BTT Brochure
2 pages
Vrealize Automation Transition
No ratings yet
Vrealize Automation Transition
93 pages
OTAG
No ratings yet
OTAG
498 pages
Absence Calculation
No ratings yet
Absence Calculation
12 pages
11.2.4.6 Lab - Accessing Network Devices With SSH - ILM
No ratings yet
11.2.4.6 Lab - Accessing Network Devices With SSH - ILM
11 pages
Sqoop Interview Questions
No ratings yet
Sqoop Interview Questions
6 pages
An Agile BPM Project Methodology
100% (1)
An Agile BPM Project Methodology
18 pages
Job Description - Transmission Design Specialist
No ratings yet
Job Description - Transmission Design Specialist
2 pages