21CS71 Imp
21CS71 Imp
Data architecture layers play a crucial role in organizing and managing the flow of data
for analytics. The architecture is typically structured into five logical layers, each serving
specific functions that contribute to the overall analytics process. Here’s a breakdown of
each layer and its functions:
Data quality is crucial in discovering new knowledge and making informed decisions for
several reasons:
1. Accuracy : High-quality data ensures that the information used for analysis is
accurate and reflects the true state of affairs. Inaccurate data can lead to misleading
conclusions, which can adversely affect decision-making processes.
3. Consistency : Maintaining data integrity over its usable life is essential. Consistent
data helps in building reliable models and analyses, which are foundational for sound
decision-making.
2. Explain the classification of Big data. List and explain different data sources.
Big Data can be classified based on various criteria, primarily focusing on data sources
and data formats. Here’s a detailed explanation of the classification of Big Data along with
different data sources:
1. Data Sources : This classification is based on where the data originates. It can be
divided into:
- Traditional Data Sources : These include structured data from relational databases (
RDBMS), in-memory data tables, and data warehouses. Examples are:
- Records : Standardized data entries in databases.
- RDBMS : Traditional databases that store structured data.
- Distributed Databases : Databases that are spread across multiple locations.
- Non-Traditional Data Sources : This includes data that is generated from various
modern sources, such as:
- Machine-Generated Data : Data produced by machines, sensors, or automated
systems. For example, data from IoT devices or logs from servers.
- Human-Sourced Data : Data generated by human interactions, such as social media
posts, emails, and biometric data.
- Business Process Data : Data generated from business operations, like transaction
records and customer interactions.
- Business Intelligence Data : Data used for analysis and reporting to support
business decisions.
2. Data Formats : This classification is based on the structure of the data. It can be
categorized into:
- Structured Data : Highly organized data that fits into predefined models, such as
databases and spreadsheets.
- Semi-Structured Data : Data that does not conform to a strict structure but still
contains tags or markers to separate elements, such as XML or JSON files.
- Unstructured Data : Data that lacks a predefined format, such as text documents,
images, videos, and social media content.
2. Data Marts : Subsets of data warehouses that focus on specific business areas or
departments, allowing for more targeted analysis.
3. Data Warehouse : A large storage system that aggregates data from different sources
for analysis and reporting, typically structured for complex queries.
4. NoSQL Databases : These databases, such as MongoDB and Cassandra, are designed
to handle unstructured and semi-structured data, providing flexibility in data storage and
retrieval.
5. Sensor Data : Data collected from various sensors, such as temperature, humidity, and
motion sensors, often used in IoT applications.
7. External Data : This includes data from outside the organization, such as web data,
social media interactions, weather data, and health records, which can enrich internal
datasets.
Data storage and analysis are critical components of Big Data management, especially
given the vast amounts of information generated today. With the evolution of technology,
traditional storage methods have become inadequate for handling the sheer volume,
variety, and velocity of data.
Data Storage : Modern data storage solutions include distributed file systems,
operational data stores (ODS), data marts, and data warehouses. Technologies like
NoSQL databases (such as MongoDB and Cassandra) are increasingly popular due to
their ability to handle unstructured, semi-structured, and multi-structured data. Cloud
computing has also revolutionized data storage, providing scalable and elastic platforms
that can grow with the data needs of organizations. This flexibility allows businesses to
store massive datasets efficiently while ensuring quick access and retrieval.
Data Analysis : The analysis of Big Data involves using advanced tools and techniques
to extract meaningful insights from large datasets. This process typically includes data
ingestion, pre-processing, and transformation, which prepare the data for analysis. Various
software tools, such as MapReduce, Hive, Pig, and Spark, are employed for processing
the data. The goal is to identify patterns, trends, and correlations that can inform decision-
making and enhance business intelligence. Effective data analysis can
JOIN WHATSAPP CHANNEL
OR GROUP
lead to improved risk management, contextual marketing, and real-time analytics,
ultimately driving business success.
MODULE 2
DOWNLOAD
1. List Hadoop core components and plain with appropriate diagram.
1. Hadoop Common : This module contains the libraries and utilities required by
other Hadoop modules. It includes components for distributed file systems, general
input/output, serialization, Java RPC (Remote Procedure Call), and file-based data
structures.
3. MapReduce : This is the programming model used in Hadoop for processing large
datasets in parallel and in batches. It consists of two main functions: the Mapper, which
processes input data, and the Reducer, which aggregates the results.
These components work together to provide a robust framework for distributed data
processing, making Hadoop a powerful tool for handling big data.
The Hadoop MapReduce framework is a powerful model for processing large datasets in a
distributed computing environment. It operates through two main functions: distributing
jobs across various nodes in a cluster and organizing the results from these nodes into a
cohesive output.
1. Job Submission : A client node submits a request for a job to the JobTracker, which is
a daemon (background program) in the Hadoop framework. This request includes the
application task or user query that needs to be processed.
2. Resource Estimation : The JobTracker first estimates the resources required for
processing the request. This involves analyzing the current state of the slave nodes (
DataNodes) in the cluster to determine their availability and capacity.
3. Task Queueing : After assessing the resources, the JobTracker places the mapping
tasks in a queue. This ensures that tasks are organized and can be executed efficiently.
4. Task Execution : The execution of the job is managed by two main processes:
- Mapper : The Mapper takes the input data and processes it into key/value pairs. It
runs on the nodes where the data is stored, which optimizes data locality and reduces
network congestion.
- Reducer : After the Mapper completes its task, the Reducer takes the output from the
Mapper as input and combines the data tuples into a smaller set of tuples. This step is
crucial for aggregating results.
5. Monitoring and Recovery : The JobTracker continuously monitors the progress of the
tasks. If a task fails, it can restart the task on another available slot, ensuring that the job
completes successfully.
6. Data Serialization : Once the Reducer finishes processing, the output is serialized
using AVRO (a data serialization system) and sent back to the client node.
3. How does the Hadoop MapReduce data flow work for a word count program?
Give an example. (10 Marks)
The Hadoop MapReduce data flow for a word count program follows a structured process
that involves two main phases: the Map phase and the Reduce phase. Let’s break down
how this works, using a word count example.
1. Map Phase:
In the Map phase, the input data (which could be a text file containing a large number of
words) is processed by the Mapper function. The Mapper reads the input data and
processes it line by line. For a word count program, the Mapper performs the following
steps:
- Input Splitting: The input file is split into smaller chunks (input splits), which are
processed in parallel by different Mapper tasks.
- Mapping Function: Each Mapper takes a line of text and breaks it down into
individual words. For each word, it emits a key-value pair where the key is the word itself
and the value is the number 1. For example, if the input line is "hello world hello", the
Mapper would output:
```
(hello, 1)
(world, 1)
(hello, 1)
```
- Reducing Function: For the key "hello", it would receive the values [1, 1] and sum
them to produce (hello, 2). For "world", it would receive [1] and produce (world, 1). The
final output from the Reducer would be:
```
(hello, 2)
(world, 1)
```
This structured flow allows Hadoop to efficiently process large datasets in parallel,
making it a powerful tool for big data analytics. The entire process is managed by the
Hadoop framework, which handles job distribution, resource management, and fault
tolerance, ensuring that the word count program runs smoothly across a cluster of
machines.
2. Channel : A channel acts as a queue that temporarily holds the data received from the
source before it is sent to the sink. The data in the channel remains until it is consumed by
the sink. By default, data is stored in memory, but it can also be configured to be stored
on disk to prevent data loss during network failures.
3. Sink : The sink component is responsible for delivering the data to its final
destination, which could be HDFS, a local file, or another Flume agent. Each sink can
only take data from a single channel, but a Flume agent can have multiple sinks.
The operation of Apache Flume can be visualized as a pipeline where data flows from the
source to the sink through the channel. When data is generated, it is captured by the
source, queued in the channel, and then processed by the sink for storage or further
analysis. This architecture allows for efficient data collection and transfer, ensuring that
large volumes of streaming data can be handled effectively.
MODULE 3
DOWNLOAD
1. Define NOSQL Explain Big Data NOSOL or Not - only SQL with its features ,
transactions and solutions.
Features of NoSQL:
1. Schema Flexibility : NoSQL databases allow for schema-less data storage, meaning
data can be inserted without a predefined structure. This is ideal for applications where
data formats may change over time.
2. Horizontal Scalability : NoSQL systems are designed to scale out by adding more
servers (data nodes) to handle increased loads, making them suitable for big data
applications that require scalable storage solutions for terabytes and petabytes of data.
3. Replication and Fault Tolerance : NoSQL databases support data replication across
multiple nodes, ensuring high availability and reliability. If one node fails, others can
continue to serve requests, enhancing fault tolerance.
4. Distributable Architecture : NoSQL solutions allow for sharding, which means data
can be partitioned and distributed across multiple clusters, improving performance and
throughput.
5. Support for Various Data Models : NoSQL encompasses various data storage
models, including key-value stores, document stores (like MongoDB), column-family
stores (like Cassandra), and graph databases, each suited for different types of
applications.
Transactions in NoSQL:
NoSQL databases often sacrifice some of the traditional ACID (Atomicity, Consistency,
Isolation, Durability) properties found in SQL databases in favor of the CAP
(Consistency, Availability, Partition Tolerance) theorem and BASE (Basically Available,
Soft state, Eventually consistent) properties. This means that while NoSQL databases may
not guarantee immediate consistency, they are designed to be highly available and
partition-tolerant, making them suitable for distributed environments.
Graph databases are specialized systems designed to model and manage data that is
interconnected. Here are the key characteristics, typical uses, and examples of graph
databases:
1. Data Structure :
- RDBMS : Uses a structured format with tables, rows, and columns. Each table has a
predefined schema, meaning the structure of the data must be defined before data can be
inserted.
- MongoDB : Utilizes a schema-less design, storing data in collections of documents.
Each document can have a different structure, allowing for greater flexibility in data
storage.
2. Data Storage :
- RDBMS : Data is stored in tables, where each row represents a record and each
column represents a field in that record.
- MongoDB : Data is stored in collections, which are analogous to tables, but each
document (similar to a row) can contain varying fields and data types.
3. Relationships :
- RDBMS : Supports complex relationships through the use of joins between tables. -
MongoDB : Does not use joins; instead, it supports embedded documents,
allowing related data to be stored together within a single document.
4. Primary Key :
- RDBMS : Each table has a primary key that uniquely identifies each record.
- MongoDB : Each document has a default primary key (id) provided by MongoDB
itself.
6. Query Language :
- RDBMS : Uses SQL (Structured Query Language) for querying data, which is
powerful but requires a structured approach.
- MongoDB : Uses a document-based query language that allows for dynamic queries,
which can be nearly as powerful as SQL but is designed for the document model.
8. Flexibility :
- RDBMS : Rigid structure due to predefined schemas, making it less adaptable to
changes in data requirements.
- MongoDB : Highly flexible, allowing for easy modifications to the data structure
without downtime.
There are several effective ways to handle Big Data problems, each designed to optimize
performance and manage large datasets efficiently. Here are some key strategies:
3. Moving Queries to Data Nodes : Instead of sending queries to the nodes, many
NoSQL data stores move the queries to the data itself. This approach is more efficient and
is a requirement in Big Data solutions, as it reduces the amount of data transferred over
the network and speeds up query processing.
5. Sharding : Sharding is the practice of dividing a large database into smaller, more
manageable pieces called shards, which are distributed across different servers or clusters.
This not only improves performance but also allows for horizontal scalability as more
machines can be added to handle increased data loads.
1. Cluster : A cluster is made up of multiple nodes, which are the individual servers that
store data. Each cluster can contain multiple keyspaces.
3. Column : A column in Cassandra consists of three parts: a column name, a value, and
a timestamp. This structure allows for efficient storage and retrieval of data.
4. Column Family : This is a collection of columns that are grouped together by a row
key. It is similar to a table in relational databases but is more flexible in terms of schema.
Cassandra manages keyspaces by partitioning keys into ranges and assigning these ranges
to specific nodes, which helps in distributing the data evenly across the cluster.
Cassandra supports a variety of data types to accommodate different kinds of data. Here
are some of the key data types:
2. Collection Types : These allow for the storage of multiple values in a single column:
3. User-Defined Types (UDTs) : Cassandra allows users to define their own data types,
which can encapsulate multiple fields of different types, providing a way to model
complex data structures.
5. Blob : A binary large object that can store any type of binary data.
(i) Cassandra Query Language (CQL) commands are essential for interacting with
Cassandra databases. Here are some key commands and their functionalities:
(ii) NoSQL databases play a crucial role in managing Big Data due to their unique
characteristics and capabilities. They are designed to handle large volumes of data that
traditional SQL databases struggle with. Here are some key points about using NoSQL to
manage Big Data:
1. Scalability : NoSQL databases are built for horizontal scalability, allowing them to
expand by adding more servers to handle increased data loads efficiently. This is
essential for managing terabytes and petabytes of data.
JOIN WHATSAPP CHANNEL
OR GROUP
2. Flexibility : NoSQL databases support schema-less data models, meaning data can
be inserted without a predefined structure. This flexibility is vital for Big Data
applications where data formats can vary widely.
3. High Availability : NoSQL solutions often use data replication across multiple nodes
, ensuring that data remains accessible even if some nodes fail. This fault tolerance is
critical for maintaining uptime in Big Data environments.
4. Performance : NoSQL databases are optimized for high-speed data processing and
can handle a large number of read and write operations simultaneously, making them
suitable for real-time analytics.
6. Support for CAP and BASE : Unlike traditional databases that adhere strictly to
ACID properties, NoSQL databases often follow the CAP theorem (Consistency,
Availability, Partition tolerance) and BASE (Basically Available, Soft state, Eventually
consistent) principles, which are more suited for distributed systems handling Big Data.
MODULE 4
DOWNLOAD
Hive provides a variety of built-in functions that can be used for data manipulation and
analysis. Here are some of the key built-in functions along with their return types, syntax,
and descriptions:
1. Count Function
- Return Type: BIGINT
- Syntax: `count( )`, `count(expr)`
- Description: Returns the total number of retrieved rows.
2. Sum Function
- Return Type: DOUBLE
- Syntax: `sum(col)`, `sum(DISTINCT col)`
- Description: Returns the sum of the elements in the group or the sum of the distinct
values of the column in the group.
3. Average Function
- Return Type: DOUBLE
JOIN WHATSAPP CHANNEL
OR GROUP
- Syntax: `avg(col)`, `avg(DISTINCT col)`
- Description: Returns the average of the elements in the group or the average of the
distinct values of the column in the group.
4. Minimum Function
- Return Type: DOUBLE
- Syntax: `min(col)`
- Description: Returns the minimum value of the column in the group.
5. Maximum Function
- Return Type: DOUBLE
- Syntax: `max(col)`
- Description: Returns the maximum value of the column in the group.
These functions are similar in usage to SQL aggregate functions and are essential for
performing data analysis within Hive.
Sure! Let's break down HiveQL into its two main components: Data Definition Language
(DDL) and Data Manipulation Language (DML).
DDL in HiveQL is used to define and manage the structure of the database and its tables.
Here are some key commands:
- SHOW DATABASES : This command lists all the databases available in Hive.
- CREATE TABLE : This command creates a new table within a database. You can
specify the schema (columns and their data types) and other properties. For example:
```sql
CREATE TABLE toys_tbl (
puzzle_code STRING,
pieces SMALLINT,
cost FLOAT
);
```
- ALTER TABLE : This command modifies an existing table structure, such as adding
or dropping columns.
DML in HiveQL is used for managing the data within the tables. Here are some important
commands:
- LOAD DATA : This command is used to load data into a table from a specified file
path. For example:
```sql
LOAD DATA LOCAL INPATH '<file path>' INTO TABLE <table name>;
```
- SELECT : This command retrieves data from one or more tables. You can use various
clauses like WHERE, GROUP BY, and ORDER BY to filter and organize the results.
For example:
```sql
SELECT FROM toys_tbl WHERE cost > 10;
```
- DROP TABLE : This command not only removes the table structure but also the data
contained within it.
- INSERT : While Hive does not support traditional row-level updates and deletes, you
can insert new data into a table.
Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. It
simplifies the complexities of writing MapReduce programs by providing a dataflow
language known as Pig Latin. Let's break down its architecture, features, and applications
in detail.
2. Rich Set of Operators : Pig Latin includes a variety of built-in operators such as `
group`, `join`, `filter`, `limit`, `order by`, `parallel`, `sort`, and `split`, which facilitate
various data manipulation tasks.
3. User Defined Functions (UDFs) : Users can create custom functions that extend Pig's
capabilities, allowing for specialized processing that may not be covered by built-in
functions.
4. Support for Various Data Types : Pig can handle structured, semi-structured, and
unstructured data, making it versatile for different data sources.
5. Multi-Query Approach : Pig allows for the execution of multiple queries in a single
script, significantly reducing the amount of code needed compared to traditional
MapReduce.
6. Interactive Shell (Grunt) : Pig provides an interactive shell called Grunt, where users
can execute Pig Latin commands and scripts in real-time.
1. Analyzing Large Datasets : Pig is widely used for processing and analyzing large
volumes of data, making it suitable for big data applications.
3. Web Log Processing : Pig is effective for processing large data sources such as web
logs, enabling organizations to derive insights from user interactions.
4. Streaming Data Analysis : It can handle streaming online data, making it useful for
real-time analytics and monitoring.
5. Data Processing for Search Platforms : Pig is often employed in search engine data
processing, where it can analyze and index large datasets efficiently.
6. Time-Sensitive Data Loads : Pig can process time-sensitive data quickly, such as
analyzing Twitter data to identify user behavior patterns and generate recommendations.
Apache Pig plays a crucial role in the Hadoop ecosystem, particularly in simplifying the
process of handling large datasets. Here are several key points that highlight its
significance:
1. Abstraction Over MapReduce : Pig serves as an abstraction layer over the complex
MapReduce programming model. This means that it allows users to write data processing
tasks without needing to delve into the intricacies of MapReduce, which can be quite
complex and verbose.
2. High-Level Dataflow Language : Pig uses a high-level dataflow language called Pig
Latin, which is designed to be more intuitive and easier to use than Java-based
MapReduce code. Pig Latin is similar to SQL, making it accessible for users with a basic
understanding of SQL. This reduces the learning curve for data analysts and developers.
3. Simplified Data Processing : With Pig, users can perform complex data
transformations and manipulations with significantly less code. For instance, a Pig script
that might take 10 lines can replace a MapReduce job that requires around 200 lines of
Java code. This efficiency is particularly beneficial for rapid development and
prototyping.
4. Support for Various Data Types : Pig can handle structured, semi-structured, and
unstructured data, making it versatile for different data processing tasks. This capability is
essential in today's data landscape, where data comes in various formats.
5. Interactive and Batch Processing : Pig supports multiple execution modes, including
local mode for testing and MapReduce mode for processing data stored in HDFS. It also
allows for interactive script execution through the Grunt shell, as well as batch processing
by writing scripts in files with a .pig extension.
6. User Defined Functions (UDFs) : Pig allows developers to create UDFs, enabling
them to extend Pig's capabilities by writing custom functions in languages like Java. This
flexibility is vital for addressing specific processing needs that may not be covered by
built-in functions.
10. Community and Support : Being an Apache project, Pig benefits from a robust
community of developers and users. This community support ensures continuous
improvement, updates, and a wealth of resources for users to tap into.
Map Task
1. Input Data Handling : The Map task takes an input dataset, which is typically stored
in the Hadoop Distributed File System (HDFS), and processes it in parallel across various
nodes in a cluster. Each piece of data is treated as a key-value pair.
2. Mapping Function : The `map()` function is applied to each key-value pair (k1, v1).
The key (k1) represents a unique identifier, while the value (v1) is the data associated
with that key. The output of the `map()` function can either be zero (if no relevant data is
found) or a set of intermediate key-value pairs (k2, v2).
3. Parallel Processing : By distributing the input data across multiple nodes, the Map
task allows for parallel processing, which significantly speeds up the data handling
process.
Reduce Task
1. Combining Results : After the Map tasks complete, the output (intermediate key-
value pairs) is sent to the Reduce task. The Reduce task takes these outputs as input and
combines them into a smaller, more manageable set of data.
1. Scalability : The framework can scale horizontally by adding more nodes to the
cluster, allowing it to handle larger datasets without a significant drop in performance.
2. Data Locality : MapReduce optimizes processing by co-locating compute and storage
nodes. This means that tasks are scheduled on nodes where the data is already present,
reducing network traffic and latency.
3. Fault Tolerance : The framework is designed to handle node failures gracefully. If a
TaskTracker fails, the JobTracker can restart the task on another node, ensuring that the
processing continues without significant delays.
4. Efficient Resource Utilization : The execution framework manages the distribution of
tasks, scheduling, and synchronization, which allows for efficient use of cluster resources
and minimizes idle time.
- Ease of Use : In Pig, performing operations like joins, filters, and sorting is relatively
simple, making it accessible for users with basic SQL knowledge. In contrast, MapReduce
requires complex Java implementations, which can be challenging for those not familiar
with Java.
- Code Length : Pig uses a multi-query approach that significantly reduces the length of
code needed to perform tasks. For example, a Pig script might require only 10 lines of
code, whereas the equivalent MapReduce program could require around 200 lines.
- Compilation : Pig does not require a lengthy compilation process; it converts operators
internally into MapReduce jobs. On the other hand, MapReduce jobs involve a long
compilation process before execution.
- Data Types : Pig supports nested data types such as tuples, bags, and maps, which are
not available in MapReduce.
1. How does regression analysis predict the value of the dependent variable in case
of linear regression?(10 Marks)
2. Fitting the Regression Line : The goal of regression analysis is to find the best- fitting
line through a scatter plot of data points. This line minimizes the deviation (or error)
between the observed values and the values predicted by the model. The best-fitting line
is known as the regression line.
4. Understanding Errors : The difference between the observed values and the predicted
values is referred to as the error. The regression analysis aims to minimize these errors
across all data points, ensuring that the predictions are as accurate as possible.
2. Explain with example and algorithm, the working principle of Apriori process
for adopting the subset of frequent item sets as a frequent itemset.(10 Marks)
The Apriori algorithm is a fundamental method used in data mining for frequent itemset
mining and association rule learning. Its working principle is based on the Apriori
principle, which states that if an itemset is frequent, then all of its subsets must also be
1. Initialization : The algorithm starts by identifying all the individual items in the
database and counting their occurrences. This forms the first set of candidate itemsets,
known as 1-itemsets.
2. Candidate Generation : For each subsequent iteration, the algorithm generates new
candidate itemsets (k+1 itemsets) from the frequent itemsets of the previous iteration (k-
itemsets). This is done by joining the frequent k-itemsets with themselves.
3. Pruning : The algorithm then prunes the candidate itemsets by removing those that do
not have any frequent subsets. This is based on the anti-monotone property of support,
which states that if an itemset is not frequent, then none of its supersets can be frequent.
4. Support Counting : The remaining candidate itemsets are then tested against the
database to count their support (the frequency of occurrence). If the support of an itemset
meets or exceeds a predefined minimum support threshold, it is considered a frequent
itemset.
5. Iteration : Steps 2 to 4 are repeated until no new frequent itemsets can be generated.
Assuming the minimum support threshold is 3, all individual items (A, B, C) are frequent.
Frequent 2-itemsets are {A, B} and {B, C} (both have support ≥ 3).
Since {A, B, C} does not meet the minimum support threshold, the algorithm terminates.
3. Define Web Mining. Discuss the broad clássification of web mining and their
applications.(10 Marks)
Web mining refers to the use of techniques and algorithms to extract knowledge from web
data, which is available in the form of web documents and services. It encompasses a
variety of methods aimed at discovering patterns and insights from the vast amount of
information available on the internet. The primary goal of web mining is to transform raw
web data into meaningful knowledge that can be utilized for various applications.
Web mining can be broadly classified into three main categories based on the types of
web data being mined:
JOIN WHATSAPP CHANNEL
OR GROUP
1. Web Content Mining : This involves extracting useful information from the content
of web documents. The content can include text, images, audio, video, or structured
records like lists and tables. Applications of web content mining include:
- Classifying web documents into categories.
- Identifying topics of web documents.
- Finding similar web pages across different web servers.
- Enhancing query relevance through recommendations and filters.
2. Web Structure Mining : This focuses on discovering structural information from the
web. It analyzes the relationships between web pages, often represented as a graph where
web pages are nodes and hyperlinks are edges. Applications include:
- Identifying interesting graph patterns.
- Pre-processing the web graph to derive metrics like PageRank.
- Understanding the interconnections between different web resources.
3. Web Usage Mining : This involves analyzing user behavior and patterns based on
web usage data, such as server logs that record user interactions with a website.
Applications of web usage mining include:
- User identification and session creation.
- Detecting malicious activities and filtering content.
- Extracting usage path patterns to improve website design and user experience.
4. Define the term Social network. Explain social network as graphs with
Centralities, Ranking and Anomaly Detection.
When we consider social networks as graphs, we can represent these nodes and their
relationships as a graph, where nodes are the entities and edges represent the connections
between them. This graphical representation enables the application of various analytical
metrics to understand the structure and behavior of the network.
Centralities
JOIN WHATSAPP CHANNEL
OR GROUP
Centrality metrics are crucial for analyzing the importance of nodes within a social
network. Key centrality measures include:
1. Degree Centrality : This metric counts the number of direct connections a node has. A
higher degree indicates a more connected node.
2. Closeness Centrality : This measures how close a node is to all other nodes in the
network, reflecting its ability to access information quickly.
3. Betweenness Centrality : This indicates how often a node acts as a bridge along the
shortest path between two other nodes, highlighting its role in facilitating communication
within the network.
4. Eigenvector Centrality : This metric considers not just the number of connections a
node has, but also the quality and influence of those connections.
Ranking
Ranking within social networks often involves determining the significance of nodes
based on their centrality scores. For instance, nodes with high betweenness centrality may
be ranked higher due to their critical role in connecting disparate parts of the network.
PageRank, a well-known algorithm originally used by Google, is another method for
ranking nodes based on their connections and the importance of those connections.
Anomaly Detection
Anomaly detection in social networks involves identifying unusual patterns or behaviors
that deviate from the norm. This can be achieved through the analysis of centrality metrics
and the structure of the network. For example, a node exhibiting an unusually high degree
of connections compared to others may indicate a spam account or a potential security
threat. Techniques such as ego-networks, which focus on a specific node and its
immediate connections, can help in identifying these anomalies.
5. What are outliers? Describe the reasons for the presence of outliers in a
relationship.(10 Marks)
Outliers are data points that significantly differ from the rest of the dataset. They can
appear as if they do not belong to the dataset and are often identified as points that are
numerically distant from other observations. The presence of outliers is important to
consider because they can affect the quality of data analysis, potentially skewing results
and leading to incorrect conclusions.
1. Anomalous Situations : Outliers may arise from unusual or rare events that do not
reflect the typical behavior of the data.
5. Sampling Error : Outliers can also occur when an unfitted or biased sample is
collected from the population, resulting in data points that do not accurately represent the
overall dataset.
Understanding and identifying outliers is crucial for improving data quality and ensuring
accurate analysis, as they can significantly influence statistical outcomes and predictions.