21CS71 Bda
21CS71 Bda
B
21CS71 BDA
11 T0 20
11.Describe the web content mining and three phases for web usage mining.
Ans:
Web Content Mining is the process of finding information or resources from the content of web pages on
the internet. This can be done in two ways:
Search engine mining: Using search engines to find content quickly compared to the direct method.
Web content mining is related to both data mining and text mining for these reasons:
The information on the web is similar to data from databases or files, so existing data mining techniques
can be used.
Many web pages contain mostly text, making it similar to text mining.
Web data is often semi-structured (partly organized) or unstructured, while data mining focuses on
structured data, and text mining deals with unstructured text.
Web Usage Mining is the process of finding and analyzing patterns in users' clicks and interactions with
websites. It also includes analysing data generated from these interactions.
1. Pre-processing:
o Extracts and simplifies the features (like identifying users, sessions, and pages).
2. Pattern Discovery:
SURESH . B
o Uses techniques from statistics, data mining, machine learning, and pattern recognition to find
meaningful patterns in the data.
o Methods include clustering, classification, association rules, and sequential pattern analysis.
3. Pattern Analysis:
o Uses tools like query-based analysis, visualization, and OLAP (Online Analytical Processing)
to interpret results.
1. Degree
o Degree Distribution: Shows how vertex connections are distributed across the network.
2. Closeness
o Calculated as the reciprocal of the sum of distances from the vertex to all others.
3. Effective Closeness
o Uses the approximate average distance instead of exact shortest paths to measure closeness.
4. Betweenness
o Indicates how often a vertex lies on the shortest path between other vertex pairs.
5. PageRank
SURESH . B
o A metric for ranking vertex importance based on incoming connections from other important
vertices.
6. Contact Size
7. Indirect Contacts
8. Structure Diversity
13.Explain with a neat diagram, explain the process in MapReduce when client submitting a
job.
Ans :
Process Overview
o The user application specifies the input and output data locations (usually stored in the
Hadoop Distributed File System - HDFS) and the map and reduce functions.
o The job client (the user’s program) is responsible for packaging the job configuration (which
includes the map and reduce functions) and submitting the job to the JobTracker.
o The JobTracker is the master node in the MapReduce framework. Once it receives the job
from the client, it schedules the tasks and distributes the job's configuration to the worker
nodes (TaskTrackers) in the cluster.
o The JobTracker keeps track of the job’s progress and manages task failures.
o The TaskTrackers are the worker nodes in the MapReduce framework, one per cluster node.
They receive the tasks from the JobTracker and execute the individual map and reduce tasks.
o Each TaskTracker runs one or more map tasks and reduce tasks depending on the size and
nature of the job.
o The map tasks read input data from HDFS, process it, and output intermediate key-value
pairs.
o The reduce tasks take the intermediate key-value pairs generated by the map tasks and
perform aggregation or other transformation operations to generate the final output.
6. Job Completion:
o Once the reduce tasks are completed, the JobTracker informs the client of the job completion,
and the output is stored in HDFS.
SURESH . B
14.Explain Hive integration and work flow steps involved with a diagram.
Ans
1. Execute Query
3. Fetch Metadata
4. Receive Metadata
6. Execute Plan
o The Execution Engine interacts with the Metastore for additional metadata operations during
execution.
9. Fetch Results
o The Driver forwards the results to the Hive Interface for the user.
Ans:
MongoDB Overview
MongoDB is a NoSQL database known for its flexible document-based data model, making it highly
adaptable for various applications. It is designed to store data in JSON-like documents with dynamic
schemas, unlike traditional relational databases that store data in rows and columns.
MongoDB offers features such as high scalability, flexibility, and ease of use, making it ideal for modern,
cloud-based, and distributed applications. It ensures high availability through replication and scales
seamlessly with sharding.
1. Document-Oriented Storage:
2. Dynamic Schema:
3. Scalability:
4. High Availability:
6. Indexing:
o Full support for various types of indexes for fast query execution.
Ans:
To create a table with partitioning, the PARTITIONED BY clause is used. This allows you to define
the columns by which the table is partitioned.
Syntax:
Apache Pig is a high-level platform for creating programs that work with large datasets in a Hadoop
environment. It simplifies the process of writing complex data transformation tasks by providing:
• Abstraction over MapReduce: Allows developers to write high-level scripts instead of coding
MapReduce jobs.
• Parallel Processing Framework: Efficiently processes large-scale data in parallel using the Hadoop
Distributed File System (HDFS).
• Dataflow Language: Uses Pig Latin, a high-level dataflow language, where operations take inputs
and generate outputs for the next step.
1. Simplified Programming
o Developers use Pig Latin, a SQL-like scripting language, to write data transformations.
o Reduces coding length: 10 lines of Pig code equals 200 lines of MapReduce code.
2. Built-in Operators
o Offers built-in operators like group, join, filter, sort, and split for common data
transformation tasks.
3. Grunt Shell
o Provides an interactive shell called Grunt for writing and executing Pig Latin scripts.
o Supports various data sources, including HDFS and local file systems.
6. ETL Operations
o Extracts, Transforms, and Loads (ETL) data into HDFS in the required format.
7. Optimized Processing
o Automatically optimizes tasks before execution for faster and efficient processing.
8. Schema Flexibility
9. Parallel Execution
o Reads input files from HDFS, processes them, and writes back the output to HDFS.
o Programmers can focus on operations rather than creating separate mapper and reducer tasks.
12. Philosophy
o Adheres to "Live Anywhere, Take Anything, Domestic, and Run as if Flying," mirroring the
characteristics of the animal Pig, which inspired its name.
Ans :Cassandra:
o Handles large-scale data across many servers with no single point of failure.
1. Cluster:
o Data is distributed across nodes, offering fault tolerance and horizontal scalability.
2. Keyspace:
3. Table:
o Tables support flexible schema, allowing for wide rows with many columns.
4. Row:
o Contains columns, which store data as name, value, and timestamp (for versioning).
5. Column Family:
Ans:
Graphs provide a structured way to analyze social networks, enabling visualization and
understanding of the relationships between entities. Key elements of social networks as graphs
include:
Social network analytics involves evaluating the structural and relational properties of graphs to gain
insights into the network. Key metrics include:
1. Centrality Measures:
2. Anomaly Detection:
3. Structural Characteristics:
o Networks with high structural diversity correlate with better performance outcomes.
o Too many strong ties may negatively impact performance due to redundancy in information
flow.
SURESH . B
Ans:
Estimating the relationship between variables involves finding a mathematical expression that describes how
one variable (dependent variable) is related to another variable (independent variable). In practice, this is
often done using regression techniques.
Outliers
Outliers are data points that differ significantly from the rest of the data in a dataset. These points are far
from the mean or expected values and can affect the accuracy of models or predictions.
Identifying Outliers:
• Anomalous Situations: Rare or unusual events that deviate from the normal pattern.
• Intentional Misreporting: In cases where individuals intentionally report incorrect data (e.g., in
surveys with sensitive questions).
Impact of Outliers:
• Outliers can skew results and make it harder to establish accurate relationships between variables.
• Identifying and handling outliers (e.g., removing or adjusting them) improves model accuracy and
reliability.
SURESH . B
Variance
Variance measures the spread or dispersion of data points in a dataset. It is a key statistic for understanding
the variability of data around the mean (expected value). Variance is calculated as the average of the squared
differences from the mean.