0% found this document useful (0 votes)
21 views18 pages

Bda Answers

Big Data refers to large datasets analyzed for patterns and insights, defined by Volume, Velocity, and Variety. It is essential for improved decision-making and operational efficiency, contrasting with traditional BI which focuses on structured data and is less scalable. Key challenges include data management, quality, security, and the need for skilled professionals.

Uploaded by

Sahil Sayyad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views18 pages

Bda Answers

Big Data refers to large datasets analyzed for patterns and insights, defined by Volume, Velocity, and Variety. It is essential for improved decision-making and operational efficiency, contrasting with traditional BI which focuses on structured data and is less scalable. Key challenges include data management, quality, security, and the need for skilled professionals.

Uploaded by

Sahil Sayyad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

UNIT I

Q1: Define Big Data. Why is Big Data required? How does traditional BI environment
differ from Big Data environment?

Big Data refers to extremely large datasets that may be analyzed computationally to reveal
patterns, trends, and associations, especially relating to human behavior and interactions. Big
Data is defined by the 3Vs: Volume, Velocity, and Variety.

Why is Big Data Required?

 To analyze large volumes of structured, semi-structured, and unstructured data.


 To gain deeper insights for better decision-making.
 To enhance operational efficiency and develop new products.

Traditional BI vs Big Data Environment:

Aspect Traditional BI Big Data Environment


Data Structured Structured, Semi-structured, Unstructured
Volume GBs to TBs TBs to ZBs
Storage Centralized Distributed (HDFS)
Processing Batch Real-time, Batch
Cost High Cost-effective (Open-source)
Scalability Vertical Horizontal

Q2: What are the challenges with Big Data?

1. Volume: Managing huge datasets.


2. Velocity: Handling streaming and real-time data.
3. Variety: Integrating multiple formats.
4. Veracity: Ensuring data quality.
5. Security & Privacy: Securing sensitive data.
6. Integration: Combining diverse sources.
7. Scalability: Efficient system scaling.
8. Skill Gap: Lack of trained professionals.
9. Governance: Proper metadata and lineage tracking.
10. Cost: Managing infrastructure cost-effectively.

Q3: Define Big Data. Why is Big Data required? Write a note on data warehouse
environment.

(Definition and requirement same as Q1)


Data Warehouse Environment: A Data Warehouse is a centralized repository for storing
structured data from multiple sources. It supports business intelligence (BI) activities like
querying and reporting. It follows ETL (Extract, Transform, Load) process and typically uses
relational databases. It is optimized for read-heavy operations and is schema-based.

Q4: What are the three characteristics of Big Data? Explain the differences between BI
and Data Science.

Three Characteristics (3Vs):

1. Volume: Refers to vast amounts of data.


2. Velocity: Speed of data generation and processing.
3. Variety: Diversity in data formats and sources.

BI vs Data Science:

Feature BI Data Science


Focus Descriptive Predictive/Prescriptive
Tools SQL, OLAP Python, R, ML Libraries
Data Historical Real-time, Historical
Goal Insight Forecasting, Optimization
User Analysts Data Scientists

Q5: Describe the current analytical architecture for data scientists.

Current architecture includes:

 Data Sources: Structured (databases), unstructured (logs, media).


 Data Storage: Data lakes, HDFS, cloud storage.
 Data Processing: Spark, Hadoop, Kafka.
 Analytics Layer: ML/AI models using Python, R, TensorFlow.
 Visualization: Tableau, Power BI, matplotlib.
 Deployment: APIs, dashboards, embedded systems.

Q6: What are the key roles for the New Big Data Ecosystem?

1. Data Engineer: Manages data pipelines.


2. Data Scientist: Builds models.
3. ML Engineer: Deploys AI models.
4. Business Analyst: Translates insights.
5. Data Architect: Designs architecture.
6. DevOps: Handles deployment and monitoring.
Q7: What are key skill sets and behavioral characteristics of a data scientist?

 Skills: Programming (Python, R), Statistics, ML, Data wrangling, Visualization.


 Tools: SQL, Hadoop, Spark, Jupyter.
 Behavioral Traits: Curiosity, critical thinking, communication, business acumen.

Q8: What is Big Data Analytics? Explain in detail with its example. Big Data Analytics
refers to the process of examining large datasets to uncover hidden patterns, correlations,
market trends, and customer preferences. Example: Netflix uses Big Data to analyze user
preferences and recommend content accordingly.

Q9: Write a short note on Classification of Analytics.

1. Descriptive Analytics: What happened? (e.g., dashboards)


2. Diagnostic Analytics: Why did it happen? (e.g., root cause analysis)
3. Predictive Analytics: What will happen? (e.g., sales forecasting)
4. Prescriptive Analytics: What should be done? (e.g., optimization models)

Q10: Describe the Challenges of Big Data. (Already answered in Q2)

Q11: Write a short note on data science and data science process. Data Science:
Interdisciplinary field that uses algorithms, data analysis, and ML to extract insights.

Process:

1. Business understanding
2. Data collection
3. Data preparation
4. Modeling
5. Evaluation
6. Deployment
7. Monitoring

Q12: Write a short note on soft state eventual consistency.

 Soft State: State of the system may change over time without input.
 Eventual Consistency: In distributed systems, data will become consistent over time.
Example: In Amazon DynamoDB, updates propagate eventually to ensure all replicas
converge.

Q13: What are different phases of the Data Analytics Lifecycle? Explain each in detail.

1. Discovery: Understand business domain, define problem.


2. Data Preparation: Clean, transform, and prepare data.
3. Model Planning: Select techniques (e.g., regression, clustering).
4. Model Building: Develop models using tools like R, Python.
5. Communicate Results: Visualize and present to stakeholders.
6. Operationalize: Deploy model into production.
7. Monitoring and Maintenance: Evaluate and update models regularly

UNIT II

Q1: Explain Analytical Theory and Methods Analytical theory and methods refer to
mathematical and statistical techniques used to extract insights from data. They provide the
foundation for algorithms in machine learning and predictive modeling. Examples include
regression, classification, clustering, and time series analysis.

Q2: Write in detail concept of K-means. K-means is an unsupervised clustering algorithm


that partitions data into K clusters.

 Steps:
1. Select K initial centroids randomly
2. Assign each data point to the nearest centroid
3. Recalculate centroids as the mean of points in each cluster
4. Repeat steps 2 and 3 until convergence
 Applications: Customer segmentation, image compression, pattern recognition

Q3: Write a short note on Diagnostics Diagnostics are tools and techniques used to
evaluate model performance. They help identify model assumptions, residual errors,
multicollinearity, and overfitting. Examples include confusion matrix, ROC curve, and
residual plots.

Q4: Explain Units of Measure Units of measure indicate the scale or quantity in which data
is represented (e.g., dollars, seconds, kilograms). Ensuring consistency in units is critical for
accurate analysis and comparisons.

Q5: What is meant by Additional Considerations Additional considerations refer to


supplementary factors influencing model performance or implementation such as:

 Data quality
 Feature selection
 Handling missing values
 Model interpretability
 Ethical and legal concerns

Q6: What are the Additional Algorithms Apart from basic methods, additional algorithms
include:

 Random Forest
 Gradient Boosting
 Support Vector Machines (SVM)
 Neural Networks
 K-Nearest Neighbors (KNN)

Q7: Write a short note on linear regression model. Also apply Ordinary Least Squares
(OLS) technique to estimate the parameters. Linear regression models the relationship
between a dependent variable and one or more independent variables using a straight line.

 Model: y = β0 + β1x + ε
 OLS estimates β0 and β1 by minimizing the sum of squared errors (SSE): β1 = Σ((xi -
x̄)(yi - ȳ)) / Σ(xi - x̄)² β0 = ȳ - β1x̄

Q8: Explain Linear Regression Model with Normally Distributed Errors. In linear
regression, errors (residuals) are assumed to be normally distributed with mean zero and
constant variance. This assumption ensures valid hypothesis testing and confidence intervals.
Q9: What is Logistic regression? Explain in detail. Also explain any two of its
applications. Logistic regression is used for binary classification problems. It models the
probability that a given input belongs to a particular category using the logistic function.

 Applications:
1. Predicting disease presence (yes/no)
2. Customer churn prediction

Q10: Describe logistic regression model with respect to logistic function. The logistic
regression model uses the sigmoid function: P(y=1|x) = 1 / (1 + e^-(β0 + β1x)) It outputs
probabilities between 0 and 1, suitable for classification tasks.

Q11: Where decision tree is used? Decision trees are used in:

 Classification and regression tasks


 Risk analysis
 Customer segmentation
 Medical diagnosis

Q12: Explain the General Algorithm of Decision Tree.

1. Select the best attribute using a selection criterion (e.g., Information Gain)
2. Split the dataset into subsets
3. Repeat the process for each subset recursively
4. Stop when a stopping condition is met (e.g., pure leaf)

Q13: Write down the ID3 Algorithm. ID3 (Iterative Dichotomiser 3):

1. Calculate entropy for dataset


2. For each attribute, calculate information gain
3. Choose attribute with highest gain as root
4. Repeat for each branch until stopping criteria is met

Q14: Explain the Bayes' Theorem. Bayes' Theorem: P(A|B) = [P(B|A) * P(A)] / P(B) It
helps in updating the probability of a hypothesis given new evidence.

Q15: What are Diagnostics of Classifiers? Diagnostics assess classifier performance:


 Confusion Matrix
 Precision, Recall, F1 Score
 ROC Curve
 AUC (Area Under Curve)

Q16: Give a brief account Classification Methods used in data analytics.

 Logistic Regression
 Decision Trees
 Naive Bayes
 KNN
 SVM
 Random Forest

Q17: What is sentiment analysis? How it can be carried out? Explain it in detail.
Sentiment analysis identifies and categorizes opinions expressed in text. Steps:

 Text preprocessing
 Feature extraction (TF-IDF, embeddings)
 Classification (positive/negative/neutral) Tools: NLP libraries, ML models, lexicon-
based methods

Q18: Explain the steps involved in categorizing documents by topic.

1. Text preprocessing
2. Feature extraction
3. Apply clustering or topic modeling (e.g., LDA)
4. Interpret and label topics

Q19: Write a note on determining sentiments of documents using text analysis.


Sentiments are identified using NLP techniques by analyzing tone, words, and context.
Techniques include lexicon-based scoring and supervised ML classifiers.

Q20: Write a short note on decision tree. A decision tree is a flowchart-like structure used
for decision-making. It splits data based on features to arrive at decisions or classifications.
Q21: How to predict whether customers will buy a product or not? Explain with respect
to decision tree. Use customer data as input features. Train a decision tree on past data to
classify outcomes (buy/not buy). The tree will identify patterns and rules.

Q22: Explain a probabilistic classification method based on Naive Bayes' theorem.


Naive Bayes assumes independence between features. It calculates posterior probabilities for
each class and assigns the highest. P(Class|Data) ∝ P(Data|Class) * P(Class)

Q23: Describe additional classification methods other than decision tree and Bayes’
theorem.

 K-Nearest Neighbors (KNN)


 Support Vector Machines (SVM)
 Random Forests
 Neural Networks

Q24: How to model a structure of observations taken over time? Explain with respect to
Time series analysis. Also explain any two of its applications. Time series analysis models
sequential data points. Applications:

1. Stock price prediction


2. Sales forecasting

Q25: What are the components of time series? Explain each of them. Also write the
main steps of Box-Jenkins methodology for time series analysis. Components:

 Trend: Long-term progression


 Seasonality: Regular patterns
 Cyclicity: Irregular cycles
 Noise: Random variation

Box-Jenkins Steps:

1. Model identification
2. Parameter estimation
3. Model checking

Q26: Explain Autoregressive Integrated Moving Average Model in detail.


ARIMA(p,d,q) combines:
 AR (p): Autoregressive terms
 I (d): Differencing to make series stationary
 MA (q): Moving average of past errors

Q27: Explain additional time series methods other than Box-Jenkins methodology and
Autoregressive Integrated Moving Average Model.

 Exponential Smoothing
 Seasonal ARIMA (SARIMA)
 Prophet Model
 Long Short-Term Memory (LSTM) networks

Q28: What are major challenges with text analysis? Explain with examples.

 Ambiguity of language
 Sarcasm and irony
 Misspellings and slang
 Multilingual content Example: "This phone is sick" could be positive or negative
depending on context

Q29: What are various text analysis steps? Explain in detail.

1. Text preprocessing
2. Tokenization
3. Feature extraction
4. Classification or clustering
5. Visualization and interpretation

Q30: Describe ACME's Text Analysis Process.

1. Data acquisition
2. Cleaning and transformation
3. Sentiment scoring
4. Classification
5. Decision-making

Q31: What is the use of Regular Expressions? Explain any five regular expressions with
its description and example. Regular expressions are patterns used for string matching.
Examples:
1. \d+ – Matches digits
2. \w+ – Matches words
3. ^ – Start of string
4. $ – End of string
5. [a-zA-Z] – Matches alphabetic characters

Q32: How to normalize the text using tokenization and case folding? Explain in detail.
Also explain about Bag-of-words approach.

 Tokenization: Splitting text into words


 Case Folding: Converting to lowercase
 Bag-of-Words: Represents text by word frequency without order

Q33: How to retrieve information and applying text analysis? Explain with respect to
Term Frequency. Term Frequency (TF) counts how often a term appears in a document,
helping identify key terms. High TF indicates relevance within a document.

Q34: What is the critical problem in using Term frequency? How can it be fixed?
Problem: Common terms dominate the scores. Solution: Use TF-IDF which reduces weight
of common words.

Q35: How to categorize documents by topics? Explain in detail. Use topic modeling
methods like LDA:

 Preprocess text
 Extract features
 Fit LDA model
 Interpret topics from word distributions

Q36: Write a note on Box-Jenkins Methodology Box-Jenkins is a method to identify,


estimate, and check ARIMA models for time series forecasting.

Q37: Explain the ARIMA Model technique ARIMA (Auto-Regressive Integrated Moving
Average) captures autocorrelations in time series data. It combines differencing (I), past
values (AR), and past errors (MA).
Q38: Explain the steps involved in Text analysis with example.

1. Collect data (e.g., customer reviews)


2. Clean data (remove stop words, punctuations)
3. Tokenize and normalize text
4. Extract features (TF-IDF)
5. Apply model (e.g., sentiment analysis)
6. Interpret output

Q39: What is tokenization? Explain how it is used in text analysis. Tokenization splits
text into meaningful units (tokens). It's the first step in NLP, allowing further analysis like
sentiment detection or classification.

Q40: Describe the Term Frequency-Inverse Document Frequency method. TF-IDF = TF


* IDF

 TF: Term frequency in a document


 IDF: Inverse of number of documents the term appears in TF-IDF highlights
important words in a document while reducing the influence of common terms.

UNIT III

Q1: Explain the concept of Data Product. A data product is a product that facilitates end
goals through the use of data. It leverages data analysis, statistical modeling, and machine
learning to provide insights, recommendations, or automation. Examples include
recommendation systems, fraud detection models, and predictive maintenance systems.

Q2: How can Hadoop be used to build Data Products at scale? Hadoop enables scalable
storage and processing of big data. It supports building data products by:

 Storing large volumes of data using HDFS.


 Processing data using MapReduce or Spark.
 Enabling real-time analytics through tools like Hive and Pig.
 Providing fault-tolerance and high availability in distributed environments.

Q3: Write a short note on: The Data Science Pipeline The Data Science Pipeline refers to
the sequence of steps followed to extract insights from data. Steps include:

 Data collection
 Data cleaning
 Data exploration
 Feature engineering
 Model training
 Model evaluation
 Deployment

Q4: Write a short note on: The Big Data Pipeline The Big Data Pipeline processes large-
scale data through stages such as:

 Data ingestion (e.g., Kafka)


 Storage (e.g., HDFS, NoSQL)
 Processing (e.g., Spark, MapReduce)
 Analytics and visualization (e.g., Tableau)

Q5: Write a short note on: Hadoop – The Big Data Operating System Hadoop is
considered the OS of Big Data due to its ability to manage distributed storage and processing
of massive datasets across clusters of computers using simple programming models.

Q6: Explain Hadoop Architecture Hadoop architecture includes:

 HDFS: For distributed storage


 MapReduce: For data processing
 YARN: Resource management layer
 Hadoop Common: Utilities for Hadoop modules

Q7: Explain the different master and worker services in Hadoop

 Master Services: NameNode (manages metadata), ResourceManager (allocates


resources)
 Worker Services: DataNode (stores data), NodeManager (executes tasks)

Q8: Explain the concept of Hadoop Cluster A Hadoop cluster is a collection of machines
(nodes) configured to work together to store and process large datasets in a distributed
environment.

Q9: Explain Hadoop Distributed File System (HDFS) HDFS is a distributed file system
designed to store large data sets reliably across many machines. It has a master-slave
architecture with NameNode and DataNodes.

Q10: Explain the concept of MapReduce MapReduce is a programming model for


processing large datasets. It consists of two steps:

 Map: Converts input into key-value pairs.


 Reduce: Aggregates key-value pairs to produce final output.

Q11: Explain the MapReduce Framework It includes:

 InputFormat: Splits data into chunks


 Mapper: Processes input
 Combiner: Optimizes network usage
 Partitioner: Determines reducer for a key
 Reducer: Produces output
 OutputFormat: Writes output to storage

Q12: Explain the concept of Hadoop Streaming Hadoop Streaming allows developers to
use any programming language to write Map and Reduce functions by reading from standard
input and writing to standard output.

Q13: Advanced MapReduce Concepts:

 a) Combiners: Local reducers that minimize data transfer.


 b) Partitioners: Control distribution of keys to reducers.
 c) Job Chaining: Series of MapReduce jobs executed sequentially.

Q14: Explain the Basics of Apache Spark Apache Spark is a fast, general-purpose engine
for large-scale data processing. It supports batch and stream processing and provides APIs in
Python, Java, Scala, and R.

Q15: Explain the components of Spark Stack

 Spark Core
 Spark SQL
 Spark Streaming
 MLlib
 GraphX

Q16: Explain the concept of Resilient Distributed Datasets (RDDs) RDDs are immutable
distributed collections of objects. They support parallel operations and fault tolerance via
lineage information.

Q17: Write short note on: A typical Spark Application A Spark application includes:

 Driver program
 Cluster manager
 Executors
 Tasks

Q18: Write short note on: The Spark Execution Model Spark executes a DAG (Directed
Acyclic Graph) of stages and tasks. Each task runs on an executor and processes a partition of
the data.

Q19: What is data science pipeline? Explain in detail with a neat diagram. [Diagram not
shown] Steps:

1. Data Collection
2. Data Preparation
3. Modeling
4. Evaluation
5. Deployment
6. Monitoring
Q20: How to refactor the data science pipeline into an iterative model? Explain all its
phases with a neat diagram. Refactoring involves incorporating feedback loops and
iterations across all phases, allowing continuous improvement and adaptation to changes.

Q21: List the requirements of distributed system in order to perform computation at


scale.

 Scalability
 Fault tolerance
 High availability
 Data locality
 Load balancing
 Security

Q22: How Hadoop addresses these requirements?

 Horizontal scalability
 HDFS for fault tolerance
 YARN for resource management
 Data locality through task scheduling
 Security through Kerberos

Q23: Write a short note on Hadoop architecture. [Duplicate of Q6]

Q24: Explain with a neat diagram a small Hadoop cluster with two master nodes and
four workers nodes that implements all six primary Hadoop services. [Diagram not
shown] The services include:

 NameNode, Secondary NameNode, ResourceManager (Master)


 DataNode, NodeManager, JobHistoryServer (Worker)

Q25: Write a short note on Hadoop Distributed File System. [Duplicate of Q9]

Q26: How basic interaction can be done in Hadoop distributed file system? Explain any
five basic file system operations with its appropriate command.

1. hdfs dfs -ls / — List files


2. hdfs dfs -put localfile / — Upload file
3. hdfs dfs -get /file localfile — Download file
4. hdfs dfs -rm /file — Delete file
5. hdfs dfs -mkdir /dir — Create directory

Q27: What are various types of permissions in HDFS? What are different access levels?
Write and explain commands to set various types and access levels. What is a caveat
with file permissions on HDFS? Permissions: Read, Write, Execute Access Levels: User,
Group, Others Command: hdfs dfs -chmod 755 /file Caveat: Permissions are advisory;
actual enforcement may vary depending on security setup.

Q28: Explain functionality of map() and reduce() with a neat diagram.


 map(): Processes input and emits key-value pairs.
 reduce(): Aggregates key-value pairs to produce output.

Q29: How MapReduce can be implemented on a Cluster? Explain all phases with a neat
diagram. Phases:

 Input Splitting
 Mapping
 Shuffling & Sorting
 Reducing
 Output writing

Q30: Explain the details of data flow in a MapReduce pipeline executed on a cluster of a
few nodes with a neat diagram. Data flows from HDFS -> Mapper -> Combiner ->
Partitioner -> Reducer -> Output

Q31: Write a short note on Job Chaining. Job chaining allows multiple MapReduce jobs to
be connected, where the output of one job is the input of the next.

Q32: Demonstrate the process of Hadoop streaming in a MapReduce context. Use


streaming utility:

hadoop jar hadoop-streaming.jar \


-input input_dir \
-output output_dir \
-mapper /bin/cat \
-reducer /usr/bin/wc

Q33: Demonstrate the process of Computing on CSV Data with Hadoop Streaming. Use
Python/Perl script to parse CSV and apply logic via standard input/output.

Q34: Demonstrate the process of executing a Streaming job on a Hadoop cluster. Submit
job using Hadoop Streaming JAR, specify mapper and reducer scripts, input, and output.

Q35: Write a short note on Combiners in advanced MapReduce context. Combiners act
as mini-reducers to decrease the volume of data transferred to reducers.

Q36: Write a short note on Partitioners in advanced MapReduce context. Partitioners


determine how intermediate keys are assigned to reducers.

Q37: Write a short note on Job Chaining in advanced MapReduce context. Job chaining
links multiple jobs; output of one job is used as input for the next job.

Q38: Write in brief about Spark. Also write and explain its primary components. Spark
is a distributed computing engine with components:

 Spark Core
 Spark SQL
 Spark Streaming
 MLlib
 GraphX
UNIT IV

Q1: What are the Distributed Analysis and Patterns? Distributed analysis refers to the
processing and analysis of data spread across multiple nodes in a distributed computing
environment. Common patterns include partitioning, replication, and parallel processing.

Q2: What is Computing with Keys? Computing with keys involves using keys to group and
partition data, enabling parallel computations by distributing the data across multiple
reducers.

Q3: Write a short note on Compound Keys. Compound keys are composed of multiple
fields and are used to uniquely identify records and allow composite sorting and grouping in
distributed systems.

Q4: Explain Compound Data Serialization. Compound data serialization involves


encoding complex data structures into a format suitable for transmission or storage. Examples
include Avro, Thrift, and Protocol Buffers.

Q5: What is the Identity Pattern? The identity pattern is a MapReduce pattern where the
map or reduce function simply passes data unchanged. Useful for debugging or data
inspection.

Q6: Write a short note on Pairs versus Stripes. Pairs and stripes are two techniques for
computing co-occurrence matrices. Pairs emit a pair of items per record; stripes emit a single
key with a map of co-occurring items.

Q7: Explain different Design Patterns. Design patterns include Summarization, Filtering,
Data Organization, and Meta Patterns. Each addresses a common problem in distributed
computing.

Q8: What is meant by Summarization? Summarization is aggregating data to compute


totals, averages, or other statistics, often as a pre-step in analysis.

Q9: Explain how keys allow parallel reduction by partitioning the keyspace to multiple
reducers. By assigning a range of keys to different reducers, MapReduce can distribute work
efficiently, allowing for parallel reduction and scaling of large datasets.

Q10: What is the functionality of the explode mapper? Explode mapper transforms a
single input record into multiple output records. Example: A list of tags per post can be
exploded into individual tag-post pairs.

Q11: What is the functionality of the filter mapper? Filter mapper removes records that do
not meet specific criteria. Example: Removing records with null or invalid values.

Q12: What is the functionality of the identity pattern? The identity pattern returns input
data as output without transformation. It’s used for testing and simple data pipeline stages.

Q13: Write briefly about design patterns. Explain each of its category.
 Summarization Patterns
 Filtering Patterns
 Data Organization Patterns
 Meta Patterns Each category focuses on structuring data or operations to address a
common analysis problem in distributed environments.

Q14: Data flow for predicting number of comments from news/blog data.

 Ingest data → Clean/Tokenize → Feature extraction → Train model → Predict


comment count → Visualize

Q15: Hive Query Language Commands: i) cd $HIVE_HOME ii) CREATE DATABASE


news_data; iii) CREATE TABLE posts (id INT, title STRING); iv) LOAD DATA LOCAL
INPATH 'posts.csv' INTO TABLE posts; v) SELECT COUNT(*) FROM posts; vi) EXIT;

Q16: Data analysis commands in Hive (Examples):

1. SELECT * FROM posts LIMIT 10;


2. SELECT COUNT(*) FROM posts WHERE id > 100;
3. SELECT title FROM posts WHERE title LIKE '%AI%';

Q17: Drawback of conventional relational approach: Relational databases don’t scale well
with unstructured/big data. Resolved via NoSQL, Hadoop, and distributed processing
systems.

Q18: HBase Schema Example:

create 'blog', 'info'


put 'blog', 'row1', 'info:title', 'Big Data'
get 'blog', 'row1'

Q19: Types of HBase filters:

 ColumnFilter
 RowFilter
 ValueFilter Commands involve scan with filter parameters.

Q20: Import MySQL to HDFS:

sqoop import --connect jdbc:mysql://localhost/db --username user --password


pass --table table_name --target-dir /hdfs_dir

Q21: Import MySQL to Hive:

sqoop import --connect jdbc:mysql://localhost/db --username user --password


pass --table table_name --hive-import --create-hive-table --hive-table
hive_table

Q22: Import MySQL to HBase:


sqoop import --connect jdbc:mysql://localhost/db --username user --password
pass --table table_name --hbase-table hbase_table --column-family cf

Q23: Flume Data Flow Diagram: Agent → Channel → Sink (e.g., HDFS)

Q24: Single-agent Flume data flow: Agent reads Apache logs → Stores into HDFS.
Configuration via flume.conf.

Q25: Relations, Tuples, Filtering in Pig:

A = LOAD 'data.txt' AS (id:int, name:chararray);


B = FILTER A BY id > 100;

Q26: Projection in Pig:

C = FOREACH A GENERATE name;

Q27: Grouping and Joining in Pig:

D = GROUP A BY id;
E = JOIN A BY id, B BY id;

Q28: Storing and Outputting in Pig:

STORE A INTO 'output' USING PigStorage(',');

Q29: Pig Relational Operators: LOAD, STORE, FILTER, FOREACH, GROUP, JOIN,
ORDER, DISTINCT, UNION, SPLIT

Q30: Spark SQL Interface Architecture: Consists of:

 DataFrames
 SQL engine
 Catalyst optimizer
 Tungsten execution engine Diagram: SQL/DF → Catalyst → Logical/Physical Plan
→ Tungsten → Execution

You might also like