0% found this document useful (0 votes)
18 views13 pages

21CS71 Bda

Module 4,5 important questions

Uploaded by

amithp169
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views13 pages

21CS71 Bda

Module 4,5 important questions

Uploaded by

amithp169
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

SURESH .

B
21CS71 BDA

11 T0 20

11.Describe the web content mining and three phases for web usage mining.

Ans:

Web Content Mining is the process of finding information or resources from the content of web pages on
the internet. This can be done in two ways:

Direct mining: Analyzing the content of web pages directly.

Search engine mining: Using search engines to find content quickly compared to the direct method.

Web content mining is related to both data mining and text mining for these reasons:

The information on the web is similar to data from databases or files, so existing data mining techniques
can be used.

Many web pages contain mostly text, making it similar to text mining.

Web data is often semi-structured (partly organized) or unstructured, while data mining focuses on
structured data, and text mining deals with unstructured text.

Web Usage Mining is the process of finding and analyzing patterns in users' clicks and interactions with
websites. It also includes analysing data generated from these interactions.

The process has three main phases, as shown in Figure 9.7:

1. Pre-processing:

o Cleans and organizes the data collected from different sources.

o Extracts and simplifies the features (like identifying users, sessions, and pages).

o Formats and summarizes the data for further analysis.

2. Pattern Discovery:
SURESH . B
o Uses techniques from statistics, data mining, machine learning, and pattern recognition to find
meaningful patterns in the data.

o Methods include clustering, classification, association rules, and sequential pattern analysis.

3. Pattern Analysis:

o Filters out unimportant or irrelevant patterns.

o Uses tools like query-based analysis, visualization, and OLAP (Online Analytical Processing)
to interpret results.

12.Discuss the parameters in social graph network topological analysis


Ans Parameters in Social Graph Network Topological Analysis

1. Degree

o Measures the number of edges (connections) linked to a vertex.

o In-degree: Number of edges coming into a vertex.

o Out-degree: Number of edges going out from a vertex.

o Degree Distribution: Shows how vertex connections are distributed across the network.

2. Closeness

o Measures how close a vertex is to all other vertices in the network.

o Calculated as the reciprocal of the sum of distances from the vertex to all others.

o High closeness means faster reachability to other nodes.

3. Effective Closeness

o Uses the approximate average distance instead of exact shortest paths to measure closeness.

o Reduces computation time in large networks.

o Maintains scalability for networks with many edges.

4. Betweenness

o Indicates how often a vertex lies on the shortest path between other vertex pairs.

o Higher betweenness means the vertex acts as a bridge or connector.

o Helps identify bottlenecks or key influencers in the network.

5. PageRank
SURESH . B
o A metric for ranking vertex importance based on incoming connections from other important
vertices.

o Assumes connections represent endorsements or relationships.

o Commonly used in web and social network analysis.

6. Contact Size

o Refers to the total number of connections a vertex has.

o Indicates a vertex's connectivity.

o Larger contact size may increase maintenance costs in a big network.

7. Indirect Contacts

o Measures connections within a geodesic distance of 3 steps.

o Highlights indirect influence and extended reach.

o Related to the concept of betweenness centrality.

8. Structure Diversity

o Reflects access to diverse sub-graphs or communities within the network.

o Shows variety in connections and interactions.

o Indicates the ability to access diverse knowledge or influence groups.

13.Explain with a neat diagram, explain the process in MapReduce when client submitting a
job.

Ans :

MapReduce Process: Client Submitting a Job


SURESH . B
In a MapReduce environment, the process of submitting a job by a client involves several steps. Here's a
detailed explanation along with a neat diagram to illustrate the process:

Process Overview

1. Client Submits the Job:

o The user application specifies the input and output data locations (usually stored in the
Hadoop Distributed File System - HDFS) and the map and reduce functions.

o The job client (the user’s program) is responsible for packaging the job configuration (which
includes the map and reduce functions) and submitting the job to the JobTracker.

2. JobTracker Receives the Job:

o The JobTracker is the master node in the MapReduce framework. Once it receives the job
from the client, it schedules the tasks and distributes the job's configuration to the worker
nodes (TaskTrackers) in the cluster.

o The JobTracker keeps track of the job’s progress and manages task failures.

3. TaskTrackers Execute the Tasks:

o The TaskTrackers are the worker nodes in the MapReduce framework, one per cluster node.
They receive the tasks from the JobTracker and execute the individual map and reduce tasks.

o Each TaskTracker runs one or more map tasks and reduce tasks depending on the size and
nature of the job.

4. Map Tasks (Mapper):

o The map tasks read input data from HDFS, process it, and output intermediate key-value
pairs.

o These key-value pairs are passed on to the reduce tasks.

5. Reduce Tasks (Reducer):

o The reduce tasks take the intermediate key-value pairs generated by the map tasks and
perform aggregation or other transformation operations to generate the final output.

6. Job Completion:

o Once the reduce tasks are completed, the JobTracker informs the client of the job completion,
and the output is stored in HDFS.
SURESH . B
14.Explain Hive integration and work flow steps involved with a diagram.

Ans

Hive Integration and Workflow Steps

1. Execute Query

o The user sends a query using Hive's CLI or Web Interface.

o The query goes to the Database Driver for execution.

2. Get Query Plan

o The Driver sends the query to the Query Compiler.

o The Compiler checks the syntax and creates a query plan.

3. Fetch Metadata

o The Compiler requests metadata from the Metastore (e.g., MySQL).

4. Receive Metadata

o The Metastore sends the required metadata back to the Compiler.

5. Send Execution Plan

o The Compiler finalizes the plan and sends it to the Driver.

6. Execute Plan

o The Driver forwards the plan to the Execution Engine.

7. Run MapReduce Job

o The Execution Engine initiates a MapReduce job.


SURESH . B
o The JobTracker (in the NameNode) assigns tasks to TaskTrackers (in DataNodes).

8. Perform Metadata Operations

o The Execution Engine interacts with the Metastore for additional metadata operations during
execution.

9. Fetch Results

o The Execution Engine collects results from the DataNodes.

10. Send Results to Driver

o The Execution Engine sends the final results to the Driver.

11. Send Results to User

o The Driver forwards the results to the Hive Interface for the user.

15.Discuss about MongoDB in detail

Ans:

MongoDB Overview

MongoDB is a NoSQL database known for its flexible document-based data model, making it highly
adaptable for various applications. It is designed to store data in JSON-like documents with dynamic
schemas, unlike traditional relational databases that store data in rows and columns.

MongoDB offers features such as high scalability, flexibility, and ease of use, making it ideal for modern,
cloud-based, and distributed applications. It ensures high availability through replication and scales
seamlessly with sharding.

Key Features of MongoDB

1. Document-Oriented Storage:

o Stores data in BSON (Binary JSON) format.

o Supports nested structures, arrays, and flexible schemas.

2. Dynamic Schema:

o No need to define tables or columns upfront.

o Fields can vary between documents in the same collection.

3. Scalability:

o Horizontal scaling through sharding.


SURESH . B
o Handles large datasets across multiple servers.

4. High Availability:

o Data replication via replica sets.

o Provides automatic failover.

5. Rich Query Language:

o Supports complex queries, filtering, sorting, and aggregation.

6. Indexing:

o Full support for various types of indexes for fast query execution.

16.Write about HiveQL for the following.

Ans:

(a) Create a table with partition

(a) Create a Table with Partition

To create a table with partitioning, the PARTITIONED BY clause is used. This allows you to define
the columns by which the table is partitioned.

Syntax:

CREATE [EXTERNAL] TABLE <table_name> (


<column_name1> <data_type1>,
<column_name2> <data_type2>,
...
)
PARTITIONED BY (
<partition_column_name> <data_type>
[COMMENT '<description>']
);

(b) Add, rename and drop a partition to a table


SURESH . B

17.What is pig in Big Data? Explain the features of pig.

Ans: What is Pig in Big Data?

Apache Pig is a high-level platform for creating programs that work with large datasets in a Hadoop
environment. It simplifies the process of writing complex data transformation tasks by providing:

• Abstraction over MapReduce: Allows developers to write high-level scripts instead of coding
MapReduce jobs.

• Parallel Processing Framework: Efficiently processes large-scale data in parallel using the Hadoop
Distributed File System (HDFS).

• Dataflow Language: Uses Pig Latin, a high-level dataflow language, where operations take inputs
and generate outputs for the next step.

Features of Apache Pig

1. Simplified Programming

o Developers use Pig Latin, a SQL-like scripting language, to write data transformations.

o Reduces coding length: 10 lines of Pig code equals 200 lines of MapReduce code.

2. Built-in Operators

o Offers built-in operators like group, join, filter, sort, and split for common data
transformation tasks.

3. Grunt Shell

o Provides an interactive shell called Grunt for writing and executing Pig Latin scripts.

4. User-Defined Functions (UDFs)


SURESH . B
o Custom functions can be written in Java, Python, Ruby, etc., and easily integrated into Pig
scripts.

5. Data Processing Flexibility

o Handles structured, semi-structured, and unstructured data.

o Supports various data sources, including HDFS and local file systems.

6. ETL Operations

o Extracts, Transforms, and Loads (ETL) data into HDFS in the required format.

7. Optimized Processing

o Automatically optimizes tasks before execution for faster and efficient processing.

8. Schema Flexibility

o Handles inconsistent or missing schemas in unstructured data effectively.

9. Parallel Execution

o Executes tasks in parallel using Hadoop’s MapReduce framework, improving performance.

10. Seamless Integration with Hadoop

o Reads input files from HDFS, processes them, and writes back the output to HDFS.

11. Reduced Complexity

o Programmers can focus on operations rather than creating separate mapper and reducer tasks.

12. Philosophy

o Adheres to "Live Anywhere, Take Anything, Domestic, and Run as if Flying," mirroring the
characteristics of the animal Pig, which inspired its name.

18.Discuss about Cassandra Databases in detail

Ans :Cassandra:

o Distributed NoSQL database.

o Originally developed by Facebook, now open-source.

o Handles large-scale data across many servers with no single point of failure.

o Inspired by Google's Bigtable.

o Optimized for high availability and scalability.

o Ideal for applications needing constant uptime and eventual consistency.


SURESH . B
2. Architecture:

o No Joins or Foreign Keys (unlike relational databases).

o High write throughput and scalable reads.

o Suitable for distributed environments with large data.

Key Components of Cassandra Data Model

1. Cluster:

o A group of multiple nodes working together.

o Data is distributed across nodes, offering fault tolerance and horizontal scalability.

o Can span multiple machines, physical or virtual.

2. Keyspace:

o Highest-level data structure (equivalent to a database in relational systems).

o Groups related tables together.

o Defines replication strategy and replication factor (number of copies of data).

o Contains multiple tables and allows defining different replication strategies.

3. Table:

o Stores data in rows and columns.

o Each row is uniquely identified by a primary key.

o Tables support flexible schema, allowing for wide rows with many columns.

4. Row:

o The basic unit of data storage.

o Identified by a row key (which determines the location in the cluster).

o Contains columns, which store data as name, value, and timestamp (for versioning).

5. Column Family:

o A container for columns (similar to tables in relational databases).

o Organizes columns by row key.

o Can have arbitrary numbers of columns, which can be added dynamically.


SURESH . B
19.Write a short note on social network as graph and social network analytics.

Ans:

Social Network as Graph

A social network can be represented as a graph, where:

• Nodes (Vertices): Represent individuals or organizations.

• Edges (Links): Represent relationships or interactions, such as friendship, financial exchanges, or


shared beliefs.

Graphs provide a structured way to analyze social networks, enabling visualization and
understanding of the relationships between entities. Key elements of social networks as graphs
include:

• Clustering: Identifying tightly-knit groups or communities.

• Centrality: Measuring the importance of nodes based on degree, closeness, or betweenness.

• Triangle Counting (Cliques): Identifying small, interconnected sub-networks.

• Connected Components: Finding isolated sub-networks.

Social Network Analytics

Social network analytics involves evaluating the structural and relational properties of graphs to gain
insights into the network. Key metrics include:

1. Centrality Measures:

o Degree Centrality: The number of direct connections a node has.

o Closeness Centrality: How easily a node can reach others.

o Betweenness Centrality: A node's role in connecting other nodes.

o Eigenvector Centrality: Influence of a node based on connections to other influential nodes.

2. Anomaly Detection:

o Detecting abnormal behavior, such as dominant edges or unusual clustering patterns.

o Identifying spam using star-shaped structures or egonet analysis.

3. Structural Characteristics:

o Networks with high structural diversity correlate with better performance outcomes.

o Too many strong ties may negatively impact performance due to redundancy in information
flow.
SURESH . B

20.Illustrate how to estimating the relationship, outliers and variances?

Ans:

Estimating the Relationship

Estimating the relationship between variables involves finding a mathematical expression that describes how
one variable (dependent variable) is related to another variable (independent variable). In practice, this is
often done using regression techniques.

Outliers

Outliers are data points that differ significantly from the rest of the data in a dataset. These points are far
from the mean or expected values and can affect the accuracy of models or predictions.

Identifying Outliers:

• Human Error: Incorrect data entry or collection may cause outliers.

• Anomalous Situations: Rare or unusual events that deviate from the normal pattern.

• Sampling Error: An unrepresentative sample may introduce outliers.

• Intentional Misreporting: In cases where individuals intentionally report incorrect data (e.g., in
surveys with sensitive questions).

Impact of Outliers:

• Outliers can skew results and make it harder to establish accurate relationships between variables.

• Identifying and handling outliers (e.g., removing or adjusting them) improves model accuracy and
reliability.
SURESH . B
Variance

Variance measures the spread or dispersion of data points in a dataset. It is a key statistic for understanding
the variability of data around the mean (expected value). Variance is calculated as the average of the squared
differences from the mean.

Formula for Variance:

You might also like