21CS71 Solutions
21CS71 Solutions
US
N
7 th Semester B.E. Degree Examination
BIG DATA ANALYTICS
TIME: 03 Hours Max. Marks: 100
Note: 01. Answer any FIVE full questions choosing at least ONE question from each MODULE
The evolution of Big Data can be understood through its key characteristics,
often summarized as the "4Vs":
2. Velocity : This refers to the speed at which data is generated and processed.
In today's fast-paced digital landscape, the ability to quickly analyze and act on
data is crucial for businesses to maintain a competitive edge.
3. Variety : Big Data encompasses a wide range of data types and formats,
including structured, semi-structured, and unstructured data. This diversity arises
from multiple sources, such as social media, sensors, and transaction records,
adding complexity to data management and analysis.
4. Veracity : This characteristic addresses the quality and accuracy of the data.
With the vast amounts of data being generated, ensuring that the data is reliable
and valid is essential for accurate analysis and informed decision-making.
JOIN WHATSAPP CHANNEL
OR GROUP
21CS71
Big Data has a wide range of applications across various industries, but let me
highlight two significant ones: Marketing and Sales and Fraud Detection .
1. Marketing and Sales : Big Data plays a crucial role in enhancing marketing
strategies and sales processes. Companies leverage Big Data analytics to gain
insights into customer behavior, preferences, and trends. For instance, businesses
can analyze data to determine the most effective content at each stage of the sales
cycle, which helps in tailoring marketing campaigns to specific customer needs.
Additionally, Big Data enables companies to invest in improving their Customer
Relationship Management (CRM) systems, which can lead to increased Customer
Lifetime Value (CLTV) and reduced Customer Acquisition Cost (CAC). By
utilizing contextual marketing, businesses can send targeted advertisements based
on users' recent browsing patterns, thereby increasing the chances of conversion.
2. Fraud Detection : In the realm of financial services and e-commerce, Big Data
analytics is instrumental in detecting and preventing fraud. By integrating multiple
data sources and analyzing them, companies can gain greater insights into
transaction patterns and identify anomalies that may indicate fraudulent activity.
For example, advanced analytics can help in generating structured reports and
visualizations that highlight unusual behaviors. Moreover, the high volume of data
allows for faster detection of threats and the ability to predict potential frauds by
utilizing publicly available information. This proactive approach not only helps in
JOIN WHATSAPP CHANNEL
OR GROUP
21CS71
safeguarding assets but also enhances overall business intelligence.
c How does Berkeley data analytics stack helps in analytics take? CO1 05
1. Data Ingestion : The stack supports efficient data ingestion from multiple
sources, whether they are internal or external. This is crucial for gathering
diverse datasets that can provide richer insights.
3. Storage Solutions : The stack utilizes distributed data storage systems that can
manage high volumes of data. This ensures that data is stored efficiently and can
be accessed quickly for analysis.
6. Scalability : One of the standout features of the Berkeley data analytics stack
is its scalability. It can scale up (adding more resources to existing systems) and
scale out (adding more systems) to handle increasing workloads, which is vital
for organizations dealing with growing datasets.
7. Integration with IoT : The stack can also integrate with Internet of Things
(IoT) devices, enabling real-time data collection and analysis, which is
particularly beneficial in sectors like healthcare.
Module-2
DOWNLOAD
1. Hadoop Common : This module contains the libraries and utilities required by
other Hadoop modules, including components for the distributed file system and
general input/output operations.
In addition to these core components, the Hadoop ecosystem includes various tools
that enhance its functionality:
- Apache Pig : A high-level platform for creating programs that run on Hadoop,
using a language called Pig Latin for data transformation.
- Apache Hive : A data warehouse infrastructure that provides data
summarization and ad-hoc querying using a SQL-like language called HiveQL.
- Apache Oozie : A workflow scheduler system that manages Hadoop jobs and
allows users to define complex workflows.
- Apache HBase : A distributed, scalable, NoSQL database that runs on top of
HDFS and provides real-time read/write access to large datasets.
.
b Explain with neat diagram HDFS Components. CO2 08
1. NameNode : This is the master server that manages the metadata of the file
system. It keeps track of the file system namespace, which includes operations like
opening, closing, and renaming files and directories. The NameNode also
determines how data blocks are mapped to DataNodes and handles any DataNode
failures. Importantly, the NameNode does not store any actual data; it only
maintains the metadata.
2. DataNodes : These are the slave nodes in the HDFS architecture. DataNodes
are responsible for storing the actual data blocks and serving read and write
requests from clients. Each DataNode manages the storage of data blocks and
periodically sends heartbeat signals to the NameNode to confirm its status.
4. HDFS Blocks : HDFS divides files into large blocks, typically 64MB or
128MB in size. This design is optimized for high-throughput access to large
datasets, making it efficient for streaming data rather than random access.
1. ETL Tools : Hive provides tools that simplify the extraction, transformation,
and loading (ETL) of data.
2. Data Structuring : It imposes structure on various data formats, making it
easier to manage and query.
3. Data Access : Users can access files stored directly in HDFS or in other data
storage systems like HBase.
4. Query Execution : Hive executes queries using MapReduce or Tez, which is
an optimized version of MapReduce.
To use Hive, a user with access to HDFS can run Hive queries by simply
entering the `hive` command in the command line. If Hive starts correctly, the
user will see a `hive>` prompt, indicating that they can begin executing queries.
Apache Sqoop is a powerful tool designed for transferring data between Hadoop
and relational databases. It facilitates both the import of data from relational
database management systems (RDBMS) into the Hadoop Distributed File System
(HDFS) and the export of data from HDFS back into RDBMS. Let’s break down
the import and export methods in detail:
The export process is quite similar to the import method and also consists of two
JOIN WHATSAPP CHANNEL
OR GROUP
21CS71
steps:
1. Metadata Examination : Just like in the import process, Sqoop examines the
database for metadata before exporting data. This ensures that the data being
exported matches the structure of the target database.
2. Data Writing : Sqoop then executes a Map-only Hadoop job to write the data
back to the RDBMS. During this process, Sqoop divides the input data set into
splits and uses individual map tasks to push these splits to the database. This
parallel processing allows for efficient and fast data transfer.
Key Features:
- Parallel Processing : Sqoop exploits the MapReduce framework to perform
both import and export operations, allowing for parallel processing of sub-tasks,
which significantly speeds up the data transfer.
- Fault Tolerance : Sqoop provisions for fault tolerance, ensuring that data
transfer can recover from failures without losing data.
- Command Line Interface : Users can interact with Sqoop through a command
line interface, and it can also be accessed using Java APIs, providing flexibility in
how it is used.
b Explain Apache Oozie with neat diagram. CO2 07
2. Coordinator Jobs : These are scheduled jobs that can run at specified time
intervals or trigger based on the availability of data. This feature is particularly
useful for recurring tasks.
YARN, which stands for Yet Another Resource Negotiator, is a key component
of the Hadoop ecosystem that serves as a resource management platform. It plays
a crucial role in managing and scheduling resources for various applications
running on a Hadoop cluster. Here’s a detailed breakdown of the YARN
application framework:
2. Core Components :
- Resource Manager (RM) : The RM is the master daemon that manages the
allocation of resources across all applications in the system. It keeps track of the
available resources and the status of all Node Managers (NMs) in the cluster.
- Node Manager (NM) : Each cluster node runs an NM, which is responsible
for managing the resources on that node. It monitors resource usage (CPU,
memory) and reports this information back to the RM.
- Application Master (AM) : For each application submitted to YARN, an
AM is instantiated. The AM is responsible for negotiating resources from the
RM and working with the NMs to execute and monitor the application’s tasks.
- Containers : These are the basic units of resource allocation in YARN. A
JOIN WHATSAPP CHANNEL
OR GROUP
21CS71
container encapsulates the resources (CPU, memory) required to run a specific
task of an application.
NoSQL, which stands for "Not Only SQL," is a category of non-relational data
storage systems designed to handle large volumes of data with flexible data
models. Unlike traditional SQL databases, NoSQL databases do not require a fixed
schema, allowing for dynamic and schema-less data storage. This makes them
ideal for managing big data, accommodating various data types like key-value
pairs, document stores (e.g., MongoDB), column-family stores (e.g., Cassandra),
and graph databases.
1. Consistency (C) : This means that every read operation receives the most
recent write or an error. In other words, all nodes in the system see the same data
at the same time. If one node updates data, all other nodes must reflect that
change immediately.
2. Availability (A) : This property ensures that every request (whether read or
write) receives a response, regardless of whether it contains the most recent data.
This means that the system remains operational and responsive even if some
nodes are down or unreachable.
In essence, the CAP theorem implies that if a network partition occurs, a system
must choose between consistency and availability. For example, if a system
prioritizes consistency, it may become unavailable during a partition.
Conversely, if it prioritizes availability, it may return stale or outdated data. This
trade-off is crucial for designing distributed systems, especially in the context of
big data applications.
b Explain NOSQL Data Architecture Patterns. CO3 10
NoSQL Data Architecture Patterns refer to the various ways in which NoSQL
databases are structured to handle data storage and retrieval efficiently. Here are
some key patterns:
5. Object Stores : This pattern allows for the storage of data as objects, which can
include both the data itself and metadata. Object stores are often used for
unstructured data and are designed to handle large amounts of data efficiently.
3. Self-Healing : If a link between nodes fails, the architecture can create new
links to maintain connectivity, ensuring that the system remains operational even
in the face of hardware failures.
5. Powerful Querying : MongoDB supports a rich query language that allows for
deep querying capabilities, including dynamic queries on documents. This is
nearly as powerful as SQL, enabling complex data retrieval.
Module-4
DOWNLOAD
Q. 07 a Explain Map Reduce Execution steps with neat diagram. CO4 10
2. Mapping : The Map phase begins where the input data is processed by the
Mapper. Each Mapper takes a key-value pair as input and applies the `map()`
function to generate intermediate key-value pairs. The Mapper operates
independently on each piece of data, allowing for parallel processing across
multiple nodes.
4. Shuffling and Sorting : Once the mapping is complete, the Shuffle phase
begins. This involves redistributing the data based on the intermediate keys
generated by the Mappers. The system groups all the intermediate key-value pairs
by their keys, which prepares them for the Reduce phase. During this phase, the
data is also sorted, ensuring that all values associated with the same key are
brought together.
5. Reducing : The Reduce phase takes the grouped data and processes it using
the `reduce()` function. Each Reducer receives a key and a list of values associated
with that key. The Reducer combines these values to produce a smaller set of
output data, which is the final result of the MapReduce job.
6. Output Storage : Finally, the output from the Reduce tasks is written back to
HDFS. This output can then be used for further analysis or processing.
Hive is a data warehousing tool that was created by Facebook and is built on top
of Hadoop. It allows users to manage and analyze large datasets stored in
Hadoop's HDFS (Hadoop Distributed File System) using a SQL-like query
language called HiveQL (or HQL). Hive is particularly suited for processing
structured data and can integrate data from various heterogeneous sources,
making it a powerful tool for enterprises that need to track, manage, and analyze
large volumes of data.
Hive Architecture
The architecture of Hive consists of several key components that work together
to facilitate data processing and querying:
1. Hive Server (Thrift) : This is an optional service that allows remote clients to
submit requests to Hive and retrieve results. It exposes a simple client API for
executing HiveQL statements, enabling interaction with Hive using various
programming languages.
3. Web Interface : Hive can also be accessed through a web browser, provided
that a Hive Web Interface (HWI) server is running. Users can access Hive by
navigating to a specific URL format: `http://hadoop:<port no>/hwi`.
4. Metastore : This is the system catalog of Hive, where all other components
interact. The Metastore stores metadata about tables, databases, and columns,
including their data types and HDFS mappings. It is crucial for managing the
schema of the data being processed.
1. Execute Query : The Hive interface (CLI or Web Interface) sends a query to
the Database Driver.
2. Get Plan : The Driver sends the query to the query compiler, which checks
the syntax and prepares a query plan.
3. Get Metadata : The compiler requests metadata from the Metastore.
4. Send Metadata : The Metastore responds with the necessary metadata.
OR
Q. 08 a Explain Pig architecture for scripts dataflow and processing CO4 10
Apache Pig architecture is designed to facilitate the execution of Pig Latin scripts
in a Hadoop environment, specifically within the Hadoop Distributed File
System (HDFS). Here’s a detailed breakdown of how the architecture works for
scripts dataflow and processing:
1. Pig Latin Scripts Submission : The process begins when a Pig Latin script is
submitted to the Apache Pig Execution Engine. This engine is responsible for
interpreting and executing the commands written in Pig Latin.
2. Parser : Once the script is submitted, it goes through a parser. The parser
performs several critical functions:
- Type Checking : It checks the types of the data being processed to ensure
they are compatible with the operations specified in the script.
- Syntax Checking : It verifies that the script adheres to the correct syntax of
Pig Latin.
- Directed Acyclic Graph (DAG) Generation : The output of the parser is a
Directed Acyclic Graph (DAG). In this graph, nodes represent logical operators
(like join, filter, etc.), and edges represent the data flows between these
operations. The acyclic nature ensures that there are no cycles in the data flow,
meaning that data flows in one direction without looping back.
5. Data Processing : The execution engine processes the data according to the
operations defined in the Pig Latin script. It reads input data from HDFS,
performs the specified transformations, and writes the output back to HDFS.
7. User Defined Functions (UDFs) : If there are specific functions that are not
available in the built-in Pig operators, users can create UDFs in other
programming languages (like Java) and embed them in their Pig Latin scripts.
1. Input Preparation : Before data can be processed by the mapper, it must be converted
into key-value pairs. This is crucial because the mapper only understands data in this
format. The transformation into key-value pairs is typically handled by a component called
the RecordReader.
2. InputSplit : This defines a logical representation of the data and breaks it into smaller,
manageable pieces for processing. Each piece is then passed to the mapper.
3. RecordReader : This component interacts with the InputSplit to convert the split data
into records formatted as key-value pairs. By default, it uses `TextInputFormat` to read the
data, ensuring that it is in a suitable format for the mapper.
4. Map Phase : During the Map phase, the mapper processes each key-value pair (k1,
v1). The key (k1) represents a unique identifier, while the value (v1) is the associated data.
The output of the map function can either be zero (if no relevant values are found) or a set
of intermediate key-value pairs (k2, v2). Here, k2 is a new key generated based on the
processing logic, and v2 contains the information needed for the subsequent Reduce
phase.
5. Reduce Phase : After the Map phase, the output key-value pairs are shuffled and
sorted. The Reduce task takes these intermediate key-value pairs (k2, v2) as input. It
groups the values associated with each key and applies a reducing function to produce a
smaller set of output key-value pairs (k3, v3). This output is then written to the final output
file.
6. Grouping and Aggregation : The grouping operation is performed during the shuffle
phase, where all pairs with the same key are collected together. This allows the reducer to
apply aggregate functions like count, sum, average, min, and max on the grouped data.
Module-5
JOIN WHATSAPP CHANNEL
OR GROUP
21CS71
DOWNLOAD
Q. 09 a What is Machine Learning? Explain different types of Regression CO5 10
Analysis.
Machine Learning (ML) is a subset of Artificial Intelligence (AI) that focuses on the
development of algorithms that allow computers to learn from and make predictions
based on data. It involves using statistical techniques to enable machines to improve
their performance on a specific task over time without being explicitly programmed for
each scenario. ML can be broadly categorized into supervised learning, unsupervised
learning, and reinforcement learning.
1. Simple Linear Regression : This is the most basic form of regression analysis. It
models the relationship between a single independent variable (predictor) and a
dependent variable (outcome) using a linear equation. The goal is to find the best-fitting
line through a scatter plot of data points, minimizing the deviation (error)
6. Ridge and Lasso Regression : These are techniques used to prevent overfitting in
regression models. Ridge regression adds a penalty equal to the square of the magnitude
of coefficients, while Lasso regression adds a penalty equal to the absolute value of the
magnitude of coefficients.
K-means clustering is a popular and straightforward algorithm used in data mining and
machine learning for partitioning a dataset into distinct groups, or clusters. The main
goal of K-means is to divide the data into K clusters, where each data point belongs to
the cluster with the nearest mean, serving as a prototype of the cluster.
1. Initialization : First, you need to choose the number of clusters, K. Then, K initial
centroids (the center points of the clusters) are randomly selected from the dataset.
2. Assignment Step : Each data point is assigned to the nearest centroid based on a
distance metric, typically Euclidean distance. This means that for each data point, the
algorithm calculates the distance to each centroid and assigns the point to the cluster
represented by the closest centroid.
3. Update Step : After all points have been assigned to clusters, the centroids are
recalculated. This is done by taking the mean of all the data points that belong to each
cluster. The new centroid is the average position of all the points in that cluster.
4. Repeat : Steps 2 and 3 are repeated until the centroids no longer change
significantly, indicating that the algorithm has converged, or until a predetermined
number of iterations is reached.
5. Output : The final output of the K-means algorithm is the K clusters of data points,
along with their corresponding centroids.
K-means clustering is widely used due to its simplicity and efficiency, especially for
large datasets. However, it does have some limitations, such as sensitivity to the initial
placement of centroids and the requirement to specify the number of clusters in advance.
The Naïve Bayes Theorem is a supervised machine learning algorithm based on Bayes'
Theorem and is primarily used for classification tasks. It assumes that features are
independent of each other, which is why it’s called naïve. Despite this simplifying
assumption, it often works surprisingly well in many practical applications
The five phases in a process pipeline for text mining are designed to efficiently analyze
and extract valuable information from unstructured text. Here’s a detailed breakdown of
each phase:
1. Text Pre-processing : This initial phase involves preparing the text for analysis. It
includes several steps such as:
- Tokenization : Breaking down the text into individual words or tokens.
- Normalization : Converting all text to a standard format, such as lowercasing.
- Removing Stop Words : Filtering out common words that may not add significant
meaning (e.g., "and," "the").
- Stemming and Lemmatization : Reducing words to their base or root form to treat
different forms of a word as the same (e.g., "running" to "run").
2. Feature Extraction : In this phase, relevant features are extracted from the pre-
processed text. This can involve:
- Vectorization : Converting text into numerical format using techniques like Term
Frequency-Inverse Document Frequency (TF-IDF) or word embeddings.
- Identifying Key Phrases : Extracting important phrases or terms that represent the
content of the text.
4. Evaluation : After modeling, the results need to be evaluated to assess their accuracy
and effectiveness. This can involve:
- Performance Metrics : Using metrics such as precision, recall, and F1-score to
measure the model's performance.
- Cross-Validation : Testing the model on different subsets of data to ensure its
robustness.
5. Analysis of Results : The final phase focuses on interpreting the outcomes of the
text mining process. This includes:
- Visualizing Data : Creating visual representations of the results to identify patterns
and insights.
- Using Results for Decision Making : Applying the insights gained to improve
business processes, enhance marketing strategies, or inform future actions.
These phases work iteratively and interactively, allowing for continuous refinement and
improvement of the text mining process. Each phase is crucial for transforming raw text
data into actionable knowledge.
b Explain Web Usage Mining. CO5 10
Web Usage Mining is a fascinating area of data mining that focuses on discovering and
analyzing patterns from web usage data. This type of mining is particularly concerned
The process of Web Usage Mining can be broken down into three main phases:
1. Pre-processing : This initial phase involves converting the raw usage data collected
from various sources into a format suitable for analysis. This data often comes from web
server logs, which typically include information such as the IP address of the user, the
pages they accessed, and the time of access. The goal here is to clean and organize the
data to facilitate effective pattern discovery.
2. Pattern Discovery : In this phase, various algorithms and methods are applied to the
pre-processed data to uncover interesting patterns. Techniques from fields such as
machine learning, statistics, and information retrieval are utilized. For example, methods
like clustering, classification, and association rule mining can help identify common user
behaviors, such as frequently accessed pages or typical navigation paths.
3. Pattern Analysis : After patterns have been discovered, they are analyzed to extract
meaningful insights. This analysis can reveal trends in user behavior, such as peak usage
times, popular content, and user preferences. The insights gained can be used to inform
decisions about website design, content placement, and targeted marketing efforts.
Bloom’s Taxonomy Level: Indicate as L1, L2, L3, L4, etc. It is also desirable to indicate the COs and POs to
be attained by every bit of questions.