0% found this document useful (0 votes)

18 views13 pages

21CS71 Bda

Module 4,5 important questions

Uploaded by

amithp169

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views13 pages

21CS71 Bda

Module 4,5 important questions

Uploaded by

amithp169

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

SURESH .

B
21CS71 BDA

11 T0 20

11.Describe the web content mining and three phases for web usage mining.

Ans:

Web Content Mining is the process of finding information or resources from the content of web pages on
the internet. This can be done in two ways:

Direct mining: Analyzing the content of web pages directly.

Search engine mining: Using search engines to find content quickly compared to the direct method.

Web content mining is related to both data mining and text mining for these reasons:

The information on the web is similar to data from databases or files, so existing data mining techniques
can be used.

Many web pages contain mostly text, making it similar to text mining.

Web data is often semi-structured (partly organized) or unstructured, while data mining focuses on
structured data, and text mining deals with unstructured text.

Web Usage Mining is the process of finding and analyzing patterns in users' clicks and interactions with
websites. It also includes analysing data generated from these interactions.

The process has three main phases, as shown in Figure 9.7:

1. Pre-processing:

o Cleans and organizes the data collected from different sources.

o Extracts and simplifies the features (like identifying users, sessions, and pages).

o Formats and summarizes the data for further analysis.

2. Pattern Discovery:
SURESH . B
o Uses techniques from statistics, data mining, machine learning, and pattern recognition to find
meaningful patterns in the data.

o Methods include clustering, classification, association rules, and sequential pattern analysis.

3. Pattern Analysis:

o Filters out unimportant or irrelevant patterns.

o Uses tools like query-based analysis, visualization, and OLAP (Online Analytical Processing)
to interpret results.

12.Discuss the parameters in social graph network topological analysis

Ans Parameters in Social Graph Network Topological Analysis

1. Degree

o Measures the number of edges (connections) linked to a vertex.

o In-degree: Number of edges coming into a vertex.

o Out-degree: Number of edges going out from a vertex.

o Degree Distribution: Shows how vertex connections are distributed across the network.

2. Closeness

o Measures how close a vertex is to all other vertices in the network.

o Calculated as the reciprocal of the sum of distances from the vertex to all others.

o High closeness means faster reachability to other nodes.

3. Effective Closeness

o Uses the approximate average distance instead of exact shortest paths to measure closeness.

o Reduces computation time in large networks.

o Maintains scalability for networks with many edges.

4. Betweenness

o Indicates how often a vertex lies on the shortest path between other vertex pairs.

o Higher betweenness means the vertex acts as a bridge or connector.

o Helps identify bottlenecks or key influencers in the network.

5. PageRank
SURESH . B
o A metric for ranking vertex importance based on incoming connections from other important
vertices.

o Assumes connections represent endorsements or relationships.

o Commonly used in web and social network analysis.

6. Contact Size

o Refers to the total number of connections a vertex has.

o Indicates a vertex's connectivity.

o Larger contact size may increase maintenance costs in a big network.

7. Indirect Contacts

o Measures connections within a geodesic distance of 3 steps.

o Highlights indirect influence and extended reach.

o Related to the concept of betweenness centrality.

8. Structure Diversity

o Reflects access to diverse sub-graphs or communities within the network.

o Shows variety in connections and interactions.

o Indicates the ability to access diverse knowledge or influence groups.

13.Explain with a neat diagram, explain the process in MapReduce when client submitting a
job.

Ans :

MapReduce Process: Client Submitting a Job

SURESH . B
In a MapReduce environment, the process of submitting a job by a client involves several steps. Here's a
detailed explanation along with a neat diagram to illustrate the process:

Process Overview

1. Client Submits the Job:

o The user application specifies the input and output data locations (usually stored in the
Hadoop Distributed File System - HDFS) and the map and reduce functions.

o The job client (the user’s program) is responsible for packaging the job configuration (which
includes the map and reduce functions) and submitting the job to the JobTracker.

2. JobTracker Receives the Job:

o The JobTracker is the master node in the MapReduce framework. Once it receives the job
from the client, it schedules the tasks and distributes the job's configuration to the worker
nodes (TaskTrackers) in the cluster.

o The JobTracker keeps track of the job’s progress and manages task failures.

3. TaskTrackers Execute the Tasks:

o The TaskTrackers are the worker nodes in the MapReduce framework, one per cluster node.
They receive the tasks from the JobTracker and execute the individual map and reduce tasks.

o Each TaskTracker runs one or more map tasks and reduce tasks depending on the size and
nature of the job.

4. Map Tasks (Mapper):

o The map tasks read input data from HDFS, process it, and output intermediate key-value
pairs.

o These key-value pairs are passed on to the reduce tasks.

5. Reduce Tasks (Reducer):

o The reduce tasks take the intermediate key-value pairs generated by the map tasks and
perform aggregation or other transformation operations to generate the final output.

6. Job Completion:

o Once the reduce tasks are completed, the JobTracker informs the client of the job completion,
and the output is stored in HDFS.
SURESH . B
14.Explain Hive integration and work flow steps involved with a diagram.

Ans

Hive Integration and Workflow Steps

1. Execute Query

o The user sends a query using Hive's CLI or Web Interface.

o The query goes to the Database Driver for execution.

2. Get Query Plan

o The Driver sends the query to the Query Compiler.

o The Compiler checks the syntax and creates a query plan.

3. Fetch Metadata

o The Compiler requests metadata from the Metastore (e.g., MySQL).

4. Receive Metadata

o The Metastore sends the required metadata back to the Compiler.

5. Send Execution Plan

o The Compiler finalizes the plan and sends it to the Driver.

6. Execute Plan

o The Driver forwards the plan to the Execution Engine.

7. Run MapReduce Job

o The Execution Engine initiates a MapReduce job.

SURESH . B
o The JobTracker (in the NameNode) assigns tasks to TaskTrackers (in DataNodes).

8. Perform Metadata Operations

o The Execution Engine interacts with the Metastore for additional metadata operations during
execution.

9. Fetch Results

o The Execution Engine collects results from the DataNodes.

10. Send Results to Driver

o The Execution Engine sends the final results to the Driver.

11. Send Results to User

o The Driver forwards the results to the Hive Interface for the user.

15.Discuss about MongoDB in detail

Ans:

MongoDB Overview

MongoDB is a NoSQL database known for its flexible document-based data model, making it highly
adaptable for various applications. It is designed to store data in JSON-like documents with dynamic
schemas, unlike traditional relational databases that store data in rows and columns.

MongoDB offers features such as high scalability, flexibility, and ease of use, making it ideal for modern,
cloud-based, and distributed applications. It ensures high availability through replication and scales
seamlessly with sharding.

Key Features of MongoDB

1. Document-Oriented Storage:

o Stores data in BSON (Binary JSON) format.

o Supports nested structures, arrays, and flexible schemas.

2. Dynamic Schema:

o No need to define tables or columns upfront.

o Fields can vary between documents in the same collection.

3. Scalability:

o Horizontal scaling through sharding.

SURESH . B
o Handles large datasets across multiple servers.

4. High Availability:

o Data replication via replica sets.

o Provides automatic failover.

5. Rich Query Language:

o Supports complex queries, filtering, sorting, and aggregation.

6. Indexing:

o Full support for various types of indexes for fast query execution.

16.Write about HiveQL for the following.

Ans:

(a) Create a table with partition

(a) Create a Table with Partition

To create a table with partitioning, the PARTITIONED BY clause is used. This allows you to define
the columns by which the table is partitioned.

Syntax:

CREATE [EXTERNAL] TABLE <table_name> (

<column_name1> <data_type1>,
<column_name2> <data_type2>,
...
)
PARTITIONED BY (
<partition_column_name> <data_type>
[COMMENT '<description>']
);

(b) Add, rename and drop a partition to a table

SURESH . B

17.What is pig in Big Data? Explain the features of pig.

Ans: What is Pig in Big Data?

Apache Pig is a high-level platform for creating programs that work with large datasets in a Hadoop
environment. It simplifies the process of writing complex data transformation tasks by providing:

• Abstraction over MapReduce: Allows developers to write high-level scripts instead of coding
MapReduce jobs.

• Parallel Processing Framework: Efficiently processes large-scale data in parallel using the Hadoop
Distributed File System (HDFS).

• Dataflow Language: Uses Pig Latin, a high-level dataflow language, where operations take inputs
and generate outputs for the next step.

Features of Apache Pig

1. Simplified Programming

o Developers use Pig Latin, a SQL-like scripting language, to write data transformations.

o Reduces coding length: 10 lines of Pig code equals 200 lines of MapReduce code.

2. Built-in Operators

o Offers built-in operators like group, join, filter, sort, and split for common data
transformation tasks.

3. Grunt Shell

o Provides an interactive shell called Grunt for writing and executing Pig Latin scripts.

4. User-Defined Functions (UDFs)

SURESH . B
o Custom functions can be written in Java, Python, Ruby, etc., and easily integrated into Pig
scripts.

5. Data Processing Flexibility

o Handles structured, semi-structured, and unstructured data.

o Supports various data sources, including HDFS and local file systems.

6. ETL Operations

o Extracts, Transforms, and Loads (ETL) data into HDFS in the required format.

7. Optimized Processing

o Automatically optimizes tasks before execution for faster and efficient processing.

8. Schema Flexibility

o Handles inconsistent or missing schemas in unstructured data effectively.

9. Parallel Execution

o Executes tasks in parallel using Hadoop’s MapReduce framework, improving performance.

10. Seamless Integration with Hadoop

o Reads input files from HDFS, processes them, and writes back the output to HDFS.

11. Reduced Complexity

o Programmers can focus on operations rather than creating separate mapper and reducer tasks.

12. Philosophy

o Adheres to "Live Anywhere, Take Anything, Domestic, and Run as if Flying," mirroring the
characteristics of the animal Pig, which inspired its name.

18.Discuss about Cassandra Databases in detail

Ans :Cassandra:

o Distributed NoSQL database.

o Originally developed by Facebook, now open-source.

o Handles large-scale data across many servers with no single point of failure.

o Inspired by Google's Bigtable.

o Optimized for high availability and scalability.

o Ideal for applications needing constant uptime and eventual consistency.

SURESH . B
2. Architecture:

o No Joins or Foreign Keys (unlike relational databases).

o High write throughput and scalable reads.

o Suitable for distributed environments with large data.

Key Components of Cassandra Data Model

1. Cluster:

o A group of multiple nodes working together.

o Data is distributed across nodes, offering fault tolerance and horizontal scalability.

o Can span multiple machines, physical or virtual.

2. Keyspace:

o Highest-level data structure (equivalent to a database in relational systems).

o Groups related tables together.

o Defines replication strategy and replication factor (number of copies of data).

o Contains multiple tables and allows defining different replication strategies.

3. Table:

o Stores data in rows and columns.

o Each row is uniquely identified by a primary key.

o Tables support flexible schema, allowing for wide rows with many columns.

4. Row:

o The basic unit of data storage.

o Identified by a row key (which determines the location in the cluster).

o Contains columns, which store data as name, value, and timestamp (for versioning).

5. Column Family:

o A container for columns (similar to tables in relational databases).

o Organizes columns by row key.

o Can have arbitrary numbers of columns, which can be added dynamically.

SURESH . B
19.Write a short note on social network as graph and social network analytics.

Ans:

Social Network as Graph

A social network can be represented as a graph, where:

• Nodes (Vertices): Represent individuals or organizations.

• Edges (Links): Represent relationships or interactions, such as friendship, financial exchanges, or

shared beliefs.

Graphs provide a structured way to analyze social networks, enabling visualization and
understanding of the relationships between entities. Key elements of social networks as graphs
include:

• Clustering: Identifying tightly-knit groups or communities.

• Centrality: Measuring the importance of nodes based on degree, closeness, or betweenness.

• Triangle Counting (Cliques): Identifying small, interconnected sub-networks.

• Connected Components: Finding isolated sub-networks.

Social Network Analytics

Social network analytics involves evaluating the structural and relational properties of graphs to gain
insights into the network. Key metrics include:

1. Centrality Measures:

o Degree Centrality: The number of direct connections a node has.

o Closeness Centrality: How easily a node can reach others.

o Betweenness Centrality: A node's role in connecting other nodes.

o Eigenvector Centrality: Influence of a node based on connections to other influential nodes.

2. Anomaly Detection:

o Detecting abnormal behavior, such as dominant edges or unusual clustering patterns.

o Identifying spam using star-shaped structures or egonet analysis.

3. Structural Characteristics:

o Networks with high structural diversity correlate with better performance outcomes.

o Too many strong ties may negatively impact performance due to redundancy in information
flow.
SURESH . B

20.Illustrate how to estimating the relationship, outliers and variances?

Ans:

Estimating the Relationship

Estimating the relationship between variables involves finding a mathematical expression that describes how
one variable (dependent variable) is related to another variable (independent variable). In practice, this is
often done using regression techniques.

Outliers

Outliers are data points that differ significantly from the rest of the data in a dataset. These points are far
from the mean or expected values and can affect the accuracy of models or predictions.

Identifying Outliers:

• Human Error: Incorrect data entry or collection may cause outliers.

• Anomalous Situations: Rare or unusual events that deviate from the normal pattern.

• Sampling Error: An unrepresentative sample may introduce outliers.

• Intentional Misreporting: In cases where individuals intentionally report incorrect data (e.g., in
surveys with sensitive questions).

Impact of Outliers:

• Outliers can skew results and make it harder to establish accurate relationships between variables.

• Identifying and handling outliers (e.g., removing or adjusting them) improves model accuracy and
reliability.
SURESH . B
Variance

Variance measures the spread or dispersion of data points in a dataset. It is a key statistic for understanding
the variability of data around the mean (expected value). Variance is calculated as the average of the squared
differences from the mean.

Formula for Variance:

YouTube Data Analysis Using Hadoop
No ratings yet
YouTube Data Analysis Using Hadoop
64 pages
Chapter 25
No ratings yet
Chapter 25
43 pages
Big Data 2021-2022
No ratings yet
Big Data 2021-2022
18 pages
Big Data Lab File
No ratings yet
Big Data Lab File
49 pages
Bda QB3
No ratings yet
Bda QB3
22 pages
KCS061 Solution
No ratings yet
KCS061 Solution
28 pages
Big Data Overview
No ratings yet
Big Data Overview
39 pages
Unit 5
No ratings yet
Unit 5
10 pages
Big Data Unit5
No ratings yet
Big Data Unit5
57 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Lakshmi Snowflake Resume
No ratings yet
Lakshmi Snowflake Resume
4 pages
Unit 3 & 4 Big Data
No ratings yet
Unit 3 & 4 Big Data
18 pages
BDA Viva
No ratings yet
BDA Viva
26 pages
NAAC Accredited ''A" D. E. Society's: "Rainfall Analysis in India "
No ratings yet
NAAC Accredited ''A" D. E. Society's: "Rainfall Analysis in India "
21 pages
DM - Topic Five
No ratings yet
DM - Topic Five
30 pages
Unit IV Hadoop
No ratings yet
Unit IV Hadoop
90 pages
BDH Answer Bank
No ratings yet
BDH Answer Bank
21 pages
Big Data Training
No ratings yet
Big Data Training
244 pages
Lesson 2 A Review of Hadoop
No ratings yet
Lesson 2 A Review of Hadoop
6 pages
Module - 2 - Introduction To Hadoop
No ratings yet
Module - 2 - Introduction To Hadoop
24 pages
M5
No ratings yet
M5
18 pages
DSE 3222 05 Mar 2025
No ratings yet
DSE 3222 05 Mar 2025
14 pages
BDA Assignment QP-3 IT A With Key Solutions
No ratings yet
BDA Assignment QP-3 IT A With Key Solutions
5 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
Unit 5
No ratings yet
Unit 5
21 pages
Model Paper BIG DATA (KOE097)
No ratings yet
Model Paper BIG DATA (KOE097)
8 pages
Unit 5
No ratings yet
Unit 5
35 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
18CS72-Big Data and Analytics 3rd Internal QP 7th Semester - Scheme of Evaluation
No ratings yet
18CS72-Big Data and Analytics 3rd Internal QP 7th Semester - Scheme of Evaluation
14 pages
07 BigData DataAnalysis
No ratings yet
07 BigData DataAnalysis
66 pages
Case Study: Hadoop
No ratings yet
Case Study: Hadoop
46 pages
Chapter 14
No ratings yet
Chapter 14
35 pages
Big Data and Hadoop Overview
100% (1)
Big Data and Hadoop Overview
17 pages
BDA
No ratings yet
BDA
20 pages
Assignment BDHHHH
No ratings yet
Assignment BDHHHH
15 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
1) Discuss Big Data Architecture in Detail With Help of Neat and Clean Diagram
No ratings yet
1) Discuss Big Data Architecture in Detail With Help of Neat and Clean Diagram
18 pages
Big Data and BDA
No ratings yet
Big Data and BDA
44 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
18 Module 2
No ratings yet
18 Module 2
9 pages
Q. What Is Big Data?
No ratings yet
Q. What Is Big Data?
8 pages
BDAunit III
No ratings yet
BDAunit III
4 pages
Big Data NOTES
No ratings yet
Big Data NOTES
14 pages
BDA Unit-3
No ratings yet
BDA Unit-3
63 pages
Uc PDF
No ratings yet
Uc PDF
10 pages
2 Unit 5
No ratings yet
2 Unit 5
24 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
10985C 01
No ratings yet
10985C 01
24 pages
Asritha Kolli: OS/DB Migration Using SUM With DMO Tool
100% (2)
Asritha Kolli: OS/DB Migration Using SUM With DMO Tool
10 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
NOSQL
No ratings yet
NOSQL
16 pages
Graph Analytics PDF
No ratings yet
Graph Analytics PDF
13 pages
Oracle SQL
No ratings yet
Oracle SQL
108 pages
Spark Job Dataproc
No ratings yet
Spark Job Dataproc
4 pages
30 Days SQL
No ratings yet
30 Days SQL
48 pages
Beginners Guide To MySQL HeatWave
No ratings yet
Beginners Guide To MySQL HeatWave
22 pages
Data Base Management System (DBMS) Unit - 1
No ratings yet
Data Base Management System (DBMS) Unit - 1
100 pages
Mongodb Essentials Training
No ratings yet
Mongodb Essentials Training
272 pages
CSU 07314 Practical #6 - 2024 SQL
No ratings yet
CSU 07314 Practical #6 - 2024 SQL
3 pages
Document From Nishad
No ratings yet
Document From Nishad
46 pages
No SQL
No ratings yet
No SQL
21 pages
DWM Exp2
No ratings yet
DWM Exp2
5 pages
DGC - Sources November2023 ApacheAtlasSources en
No ratings yet
DGC - Sources November2023 ApacheAtlasSources en
24 pages
Assignment Report
No ratings yet
Assignment Report
23 pages
PRELIM EXAMINATION - Database Management System 1
No ratings yet
PRELIM EXAMINATION - Database Management System 1
10 pages
Ip Month Wise Final25-26
No ratings yet
Ip Month Wise Final25-26
3 pages
Text Mining Unlocking Insights From Unstructured Data
No ratings yet
Text Mining Unlocking Insights From Unstructured Data
15 pages
DDL Commands
No ratings yet
DDL Commands
2 pages
Chapter 2 Database System Concepts and Architecture
No ratings yet
Chapter 2 Database System Concepts and Architecture
31 pages
Google Bigtable: Describe The Data Model of Bigtable
100% (1)
Google Bigtable: Describe The Data Model of Bigtable
6 pages
DWDM Unit1
No ratings yet
DWDM Unit1
93 pages
Espacenet Brochure en
No ratings yet
Espacenet Brochure en
8 pages
BloodConnect Report
No ratings yet
BloodConnect Report
4 pages
Integrative Programming C# - An OOP Language That Supports Data Encapsulation, Inheritance Polymorphism, and Method
No ratings yet
Integrative Programming C# - An OOP Language That Supports Data Encapsulation, Inheritance Polymorphism, and Method
6 pages
Anova Testing Understanding Variability in Data
No ratings yet
Anova Testing Understanding Variability in Data
8 pages
Text Mining Unlocking Insights From Unstructured Data
No ratings yet
Text Mining Unlocking Insights From Unstructured Data
8 pages
Troubleshooting High Memory Utilization in SQL Server
No ratings yet
Troubleshooting High Memory Utilization in SQL Server
2 pages
Expt 4 C
No ratings yet
Expt 4 C
25 pages
Attendance Connect SQL SERVER - en
No ratings yet
Attendance Connect SQL SERVER - en
8 pages
Outlook Wise Using The Organizational Forms Library in Exchange Server
No ratings yet
Outlook Wise Using The Organizational Forms Library in Exchange Server
12 pages
V1427: Item Does Not Match Schedule Line (Program Error)
No ratings yet
V1427: Item Does Not Match Schedule Line (Program Error)
2 pages
Practical:-4: Select From Account Where Balance (Select Sum (Balance) From Account)
No ratings yet
Practical:-4: Select From Account Where Balance (Select Sum (Balance) From Account)
4 pages
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
IBM Cognos 8 Planning
From Everand
IBM Cognos 8 Planning
Jason Edwards
No ratings yet
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
C++ Data Structures Explained: A Practical Guide with Examples
From Everand
C++ Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Java / J2EE Interview Questions You'll Most Likely Be Asked
From Everand
Java / J2EE Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet