DM assignment 2
DM assignment 2
LOVE
Btech CE Hindi 6th Sem
22001050501
Q1. Write an algorithm for k-nearest neighbor classification given k, the
nearest number of neighbors, and n, the number of attributes describing
each tuple.
4. Take the first 𝑘k tuples from the sorted list. These are the 𝑘k
3. Sort the distances in ascending order.
for 𝑋𝑛𝑒𝑤.
6. Assign the most frequent class label as the predicted class
Basic Concepts:
1. Bayes' Theorem: Bayes' theorem describes the probability of a
hypothesis given some evidence. Mathematically, it is represented
𝑃(𝐻∣𝐸)=𝑃(𝐸∣𝐻)⋅𝑃(𝐻)𝑃(𝐸)P(H∣E)=P(E)P(E∣H)⋅P(H)
as:
Basic Architecture:
1. Input Layer:
The input layer consists of neurons (nodes) corresponding to
the input features of the dataset. Each neuron represents one
feature.
Input neurons simply transmit the input values to the neurons
in the hidden layers.
2. Hidden Layers:
Hidden layers are intermediate layers between the input and
output layers.
Each neuron in a hidden layer receives inputs from all neurons
in the previous layer (or the input layer) and computes a
weighted sum of these inputs.
The weighted sum is then passed through an activation
function to introduce non-linearity into the network.
3. Output Layer:
The output layer consists of neurons responsible for producing
the network's output.
The number of neurons in the output layer depends on the
task:
For binary classification, there is usually one output
neuron with a sigmoid activation function.
For multi-class classification, there are multiple output
neurons, each representing a class, with softmax
activation.
For regression tasks, the output layer may consist of a
single neuron or multiple neurons depending on the
number of output variables.
4. Connections:
Each neuron in a layer is connected to every neuron in the
subsequent layer.
Connections have associated weights that represent the
strength of the connection. These weights are adjusted during
training to minimize the error between the predicted output
and the actual output.
Training Process:
1. Forward Propagation:
During forward propagation, input data is fed into the
network, and activations are computed layer by layer until the
output layer is reached.
Neurons in each layer compute a weighted sum of their
inputs, apply an activation function, and pass the result to the
neurons in the next layer.
2. Loss Calculation:
The output of the network is compared to the true labels using
a loss function, which measures the difference between the
predicted and actual outputs.
Common loss functions include mean squared error (MSE) for
regression and cross-entropy loss for classification.
3. Backpropagation:
Backpropagation is used to update the weights of the network
in order to minimize the loss.
It involves calculating the gradient of the loss function with
respect to the network's weights using the chain rule of
calculus.
The weights are then updated in the opposite direction of the
gradient using an optimization algorithm such as gradient
descent.
4. Iterations:
The forward propagation, loss calculation, and
backpropagation steps are repeated for multiple iterations
(epochs) or until convergence.
During each iteration, the network learns to better
approximate the mapping between inputs and outputs by
adjusting its weights.
Activation Functions:
Activation functions introduce non-linearity into the network,
allowing it to learn complex patterns and relationships in the data.
Common activation functions include:
Sigmoid: Maps input values to the range (0,1)(0,1). Often
used in the output layer for binary classification.
Hyperbolic Tangent (Tanh): Similar to the sigmoid function
but maps input values to the range (−1,1)(−1,1).
ReLU (Rectified Linear Unit): Returns 00 for negative
inputs and the input value for positive inputs. ReLU is widely
used in hidden layers due to its simplicity and effectiveness.
Softmax: Used in the output layer for multi-class
classification tasks to produce probability distributions over
multiple classes.
Advantages:
Multilayer feed-forward neural networks can learn complex non-
linear relationships in data.
They are flexible and can be adapted to various types of tasks
including classification, regression, and pattern recognition.
With appropriate architectures and training strategies, MLPs can
achieve state-of-the-art performance on many tasks.
Limitations:
MLPs require a large amount of training data to generalize well and
avoid overfitting.
They can be computationally expensive to train, especially for large
networks and datasets.
MLPs are sensitive to the choice of hyperparameters such as the
number of hidden layers, the number of neurons per layer, and the
learning rate.
1. Sampling:
Sampling involves selecting a subset of data points from the stream
for analysis.
Random, systematic, or stratified sampling techniques can be
applied to ensure that the sampled data is representative of the
entire stream.
Sampling reduces the computational overhead of processing the
entire stream while preserving important characteristics of the data.
2. Windowing:
Windowing divides the stream into fixed-size or variable-size
windows, each containing a subset of recent data points.
Common types of windows include sliding windows, tumbling
windows, and landmark windows.
Windowing allows for localized analysis over a finite window of data,
enabling the detection of temporal patterns and trends.
3. Approximate Algorithms:
Approximate algorithms provide fast and memory-efficient solutions
for processing stream data while maintaining reasonable accuracy.
Examples include Count-Min Sketch for approximate counting,
Bloom filters for approximate set membership, and reservoir
sampling for approximate sampling.
4. Summary Statistics:
Summary statistics involve computing compact representations of
the data stream that capture its essential characteristics.
Examples of summary statistics include mean, variance, count,
quantiles, histograms, and sketches.
Summary statistics enable efficient monitoring and analysis of
stream data without storing the entire stream.
5. Online Learning:
Online learning algorithms update their models continuously as new
data arrives, adapting to changes in the data distribution over time.
Online learning algorithms are well-suited for applications where the
data distribution is non-stationary or evolves over time.
Examples include online gradient descent, stochastic gradient
descent, and online clustering algorithms.
6. Incremental Processing:
Incremental processing involves updating aggregate computations
incrementally as new data arrives, rather than recomputing them
from scratch.
Incremental processing reduces the computational cost of
processing stream data by avoiding redundant computations.
Techniques such as incremental aggregation, delta processing, and
lazy evaluation are used for incremental processing.
7. Stream Clustering:
Stream clustering algorithms partition the data stream into clusters
or groups based on similarity or proximity between data points.
Clustering algorithms must be capable of handling data points that
arrive and depart from clusters dynamically over time.
Examples include BIRCH (Balanced Iterative Reducing and
Clustering using Hierarchies) and CluStream.
8. Online Anomaly Detection:
Online anomaly detection algorithms identify abnormal or unusual
patterns in the data stream in real-time.
Anomaly detection is crucial for detecting fraudulent activities,
network intrusions, and equipment failures as they occur.
Techniques include statistical methods, machine learning
algorithms, and unsupervised anomaly detection approaches.
Key Components:
1. Time-Series Data:
Time-series data consists of sequences of observations or
measurements collected over successive time intervals.
Each data point in the time series is associated with a
timestamp, indicating its position in time.
2. Similarity Metric:
A similarity metric (or distance measure) quantifies the
similarity between two time-series sequences.
Common similarity metrics for time-series data include
Euclidean distance, Dynamic Time Warping (DTW), Edit
distance, Pearson correlation coefficient, and Cosine similarity.
The choice of similarity metric depends on the characteristics
of the data and the specific requirements of the application.
3. Query Sequence:
The query sequence is the time-series sequence for which
similar sequences are sought.
It could be a complete time series or a subsequence of a
longer time series.
Applications:
1. Pattern Recognition:
Similarity search is used to identify recurring patterns or
motifs in time-series data, which can provide insights into
underlying processes or behaviors.
2. Anomaly Detection:
Anomaly detection systems leverage similarity search to
identify data sequences that deviate significantly from the
norm or expected behavior.
3. Time-Series Forecasting:
Similarity search can be used to retrieve historical data
sequences that are similar to the current or predicted
sequence, providing valuable insights for forecasting tasks.
4. Content-Based Retrieval:
Content-based retrieval systems use similarity search to
retrieve time-series data sequences that are similar to a given
query sequence, enabling efficient data retrieval and
exploration.
1. Online Learning:
Online learning algorithms update the classification model
continuously as new data instances arrive, adapting to changes in
the data distribution over time.
Online learning algorithms are well-suited for dynamic data streams
as they can handle concept drift and concept evolution.
Examples include online variants of decision trees, ensemble
methods (e.g., Online Bagging, Online Boosting), and neural
networks (e.g., Online Perceptron, Online Passive-Aggressive
algorithms).
5. Ensemble Methods:
Ensemble methods combine multiple base classifiers to improve
classification performance and robustness in dynamic data stream
settings.
Ensemble methods can handle concept drift by diversifying the base
classifiers and dynamically adjusting their weights over time.
Examples include Online Bagging, Online Boosting, and dynamic
ensemble selection techniques.
6. Window-Based Classification:
Window-based classification divides the data stream into fixed-size
or variable-size windows and performs classification within each
window.
Window-based classification methods are useful for handling
concept drift and temporal dynamics in the data stream.
Techniques include applying traditional batch classification
algorithms within each window or using ensemble methods to
combine predictions from multiple windows.
Distributed Warehousing:
1. Cloud-Based Data Warehousing:
Organizations are increasingly adopting cloud-based data
warehousing solutions such as Amazon Redshift, Google
BigQuery, and Snowflake. These platforms offer scalability,
flexibility, and ease of management, allowing organizations to
store and analyze large volumes of data without upfront
infrastructure investments.
2. Serverless Data Warehousing:
Serverless computing models, such as AWS Lambda and
Google Cloud Functions, are gaining popularity for data
warehousing tasks. These platforms abstract away
infrastructure management, allowing organizations to focus on
data analysis and application development without worrying
about provisioning and scaling resources.
3. Data Lakehouse Architecture:
The data lakehouse architecture combines the benefits of data
lakes (scalability, flexibility) with those of data warehouses
(structured querying, ACID transactions) in a single unified
platform. Technologies such as Apache Delta Lake and
Databricks Delta Lake enable organizations to build data
lakehouse solutions for analytics and data processing.
4. Edge Computing for Warehousing:
Edge computing brings compute resources closer to the data
source, enabling real-time analytics and insights at the edge
of the network. Organizations are exploring edge computing
solutions for warehousing tasks in scenarios where low latency
and data locality are critical, such as IoT, edge analytics, and
autonomous vehicles.
5. Data Mesh:
The data mesh paradigm emphasizes decentralized data
ownership and governance, where data is treated as a product
and managed by cross-functional teams. This approach aims
to overcome the limitations of centralized data warehouses by
distributing data ownership and management responsibilities
across the organization.
Data Mining:
1. Deep Learning and Neural Networks:
Deep learning techniques, including convolutional neural
networks (CNNs), recurrent neural networks (RNNs), and
transformer models, are increasingly being applied to data
mining tasks such as image recognition, natural language
processing, and time-series analysis. These models offer
state-of-the-art performance on complex data mining tasks
but require large amounts of labeled data and computational
resources.
2. Explainable AI (XAI):
Explainable AI techniques aim to enhance the transparency
and interpretability of data mining models, especially in
critical applications such as healthcare, finance, and criminal
justice. Researchers are developing methods to explain and
visualize the decision-making processes of complex machine
learning models, enabling stakeholders to understand and
trust the results.
3. Federated Learning:
Federated learning enables model training across distributed
data sources while preserving data privacy and security.
Organizations are exploring federated learning approaches for
data mining tasks in scenarios where data cannot be
centralized due to privacy regulations, data sovereignty
concerns, or proprietary data ownership.
4. Graph Mining:
Graph mining techniques focus on analyzing and extracting
insights from interconnected data structures such as social
networks, knowledge graphs, and biological networks.
Applications of graph mining include community detection,
link prediction, and recommendation systems.
5. Streaming Data Mining:
Streaming data mining involves analyzing continuous, high-
velocity data streams in real-time to extract valuable insights
and patterns. Techniques such as online learning,
approximate algorithms, and window-based processing are
used to perform data mining tasks on streaming data.
6. Ethical Data Mining:
With growing concerns about data privacy, bias, and fairness,
ethical considerations in data mining are becoming
increasingly important. Organizations are implementing
ethical guidelines, privacy-preserving techniques, and
fairness-aware algorithms to ensure responsible and ethical
data mining practices.
7. Automated Machine Learning (AutoML):
AutoML platforms automate the end-to-end process of model
selection, hyperparameter tuning, and model deployment,
making data mining more accessible to non-experts. These
platforms leverage techniques such as hyperparameter
optimization, neural architecture search, and model
ensembling to automatically generate high-performing
machine learning models.
Q10. What do you mean by class imbalance problem? What are the
different ways to resolve it?
Class imbalance problem refers to a situation in classification tasks where
the distribution of class labels in the training data is skewed, with one
class (the minority class) being significantly underrepresented compared
to the other class(es) (the majority class or classes). This imbalance can
lead to biased models that prioritize the majority class and perform poorly
in correctly predicting instances of the minority class. Class imbalance is
common in real-world datasets, especially in domains such as fraud
detection, medical diagnosis, and anomaly detection.
Graph Representation:
1. Directed and Undirected Graphs:
In directed graphs, edges have a direction indicating the flow
or relationship between nodes. In undirected graphs, edges
have no direction and represent symmetric relationships
between nodes.
2. Weighted Graphs:
Graphs can have weighted edges, where each edge has an
associated weight or value representing the strength or
importance of the relationship between nodes.
3. Attributed Graphs:
Attributed graphs contain additional information associated
with nodes and/or edges, such as node attributes (e.g., labels,
categories) and edge attributes (e.g., timestamps, weights).