BDA Unit5
BDA Unit5
In data analysis, a regression model is a statistical technique used to understand the relationship between a
dependent (target) variable and one or more independent (predictor) variables. The goal is to model and
predict the target variable based on the input features.
1. Linear Regression
3. Logistic Regression
o Assumptions: Assumes a relationship between the predictors and the log-odds of the
dependent variable.
o Equation:
o Example: Predicting whether a customer will buy a product (yes/no) based on age, income,
etc.
4. Polynomial Regression
o Used for: When the relationship between the variables is non-linear but can be
approximated by a polynomial function.
o Example: Modeling the growth of a plant over time where the relationship is curved rather
than straight.
o Lasso: Adds a penalty that can shrink coefficients to zero (L1 regularization), effectively
performing feature selection.
MULTIVARIATE ANALYSIS
Multivariate analysis refers to the set of statistical techniques used to analyse and interpret data that
involves more than two variables. It allows researchers and analysts to examine the relationships between
multiple variables simultaneously and make predictions or inferences based on that complex data.
Purpose of Multivariate Analysis
• Goal: Reduce the number of variables by transforming the data into a smaller set of uncorrelated variables
(principal components).
• Example: Reducing hundreds of features in an image to a smaller number without losing key information.
• Goal: Group data into clusters based on similarities among multiple variables.
• Example: Segmentation of customers into different groups based on spending behavior and demographics.
• Improves Predictions: Helps create more accurate models by considering multiple factors.
• Better Insight: Provides deeper insights into the interrelationships between variables.
Bayesian Analysis
Bayesian Analysis is a statistical approach that uses Bayes' Theorem to update the probability of a
hypothesis based on new evidence. Unlike traditional frequentist statistics, which estimate parameters
based on fixed data, Bayesian analysis treats parameters as random variables and incorporates prior
knowledge or beliefs into the analysis.
Key Concepts in Bayesian Analysis:
1. Prior Distribution:
The initial belief about the parameters or hypothesis before data is collected. The choice of prior can
significantly affect the results.
2. Likelihood Function:
Represents the probability of the observed data given a set of parameters.
3. Posterior Distribution:
The updated belief after considering the data, combining the prior and likelihood.
4. Bayesian Inference:
The process of updating beliefs (posterior) with new data using Bayes' Theorem.
5. Credible Interval:
A Bayesian alternative to confidence intervals. It provides a range of values within which a parameter lies
with a certain probability.
Inference in Bayesian analysis refers to the process of drawing conclusions from the posterior distribution after
incorporating observed data and prior beliefs. It is about updating our knowledge or belief about a parameter or
hypothesis based on new evidence.
Bayesian Inference:
Bayesian inference works by using Bayes' Theorem to update the probability distribution of the parameters as new
data becomes available. It involves:
1. Prior Distribution: The initial belief about the parameters before seeing the data.
3. Posterior Distribution: The updated belief after observing the data, which is proportional to the product of
the prior and likelihood.
1. Define the Prior: Choose a prior distribution based on your beliefs or prior knowledge about the
parameter(s).
3. Model the Likelihood: Determine the likelihood function that describes the probability of the data given the
parameter(s).
4. Update the Posterior: Use Bayes' Theorem to calculate the posterior distribution, which combines the prior
and likelihood.
5. Make Inferences: Use the posterior distribution to draw conclusions, such as estimating parameters,
calculating credible intervals, or making predictions.
Bayesian Networks
A Bayesian Network (also known as a Bayes Network or Belief Network) is a graphical model that represents a set of
variables and their conditional dependencies using a directed acyclic graph (DAG). Each node in the graph represents
a random variable, and the edges (arrows) represent conditional dependencies between them. Bayesian networks
are particularly useful for modeling complex systems with multiple interacting variables.
2. Edges: Represent dependencies between variables. An edge from node A to node B means that A has a direct
influence on B.
3. Conditional Probability Distributions (CPDs): Each node has a conditional probability distribution that
specifies the probability of that variable given its parents (the nodes pointing to it). If a node has no parents,
the distribution is the prior probability of that node.
Support Vectors and Kernel Methods in SVM
Support Vector Machine (SVM) is a powerful machine learning algorithm used for classification and regression tasks.
Two key concepts that are critical to understanding SVM are support vectors and kernel methods. Here's a
breakdown of each concept:
In SVM, support vectors are the data points that are closest to the decision boundary (hyperplane). These points
are critical because they are the ones that define the margin between classes. The margin is the distance between
the decision boundary and the closest points of each class, and SVM aims to maximize this margin to achieve the
best classification performance.
• Support Vectors are the data points that lie on the edge of the margin.
• The margin is maximized by finding the optimal hyperplane that separates the classes.
• Only the support vectors influence the position and orientation of the decision boundary. Non-support
vectors can be removed without affecting the model.
• SVM relies on the support vectors to create a robust classifier that generalizes well on unseen data.
Example:
Imagine you have a 2D dataset where you want to classify data into two classes: A and B. The SVM algorithm will find
the hyperplane (line) that separates the two classes with the maximum margin. The data points that are closest to
this line (the support vectors) are crucial to defining this optimal separation.
While SVM works well for linearly separable data, many real-world problems involve data that is non-linearly
separable. This means that a straight line or hyperplane cannot easily separate the data points of different classes. To
address this challenge, SVM uses kernel methods to transform the data into a higher-dimensional space where a
linear separation is possible.
What is a Kernel?
A kernel is a mathematical function that computes the inner product (similarity) between two data points in a
higher-dimensional feature space without explicitly transforming the data points. The kernel trick allows SVM to
perform efficiently in high-dimensional spaces without having to compute the transformation explicitly.
• Linear separability: SVM with a kernel can find a linear separator in a higher-dimensional space even if the
data is non-linearly separable in the original space.
• Computational efficiency: Instead of explicitly transforming the data into higher dimensions, kernel functions
compute the dot product in the transformed space directly, saving computation time and resources.
Time Series Analysis: Overview
Time series analysis involves examining data points collected or recorded at successive time intervals. The goal is to
understand the underlying structure and function of the data to make forecasts, detect patterns, or gain insights.
1. Trend
Long-term movement in the data (upward or downward).
E.g., increase in global temperature over decades.
2. Seasonality
Regular pattern repeating over a fixed period (like months or quarters).
E.g., increase in ice cream sales during summer.
3. Cyclic Patterns
Fluctuations that are not of fixed period, often influenced by economic or business cycles.
E.g., stock market cycles.
4. Irregular/Random Component
Unpredictable or residual variation in the data.
E.g., impact of a natural disaster on sales.
1. Decomposition
Breaks time series into trend, seasonality, and residuals.
2. Smoothing Methods
o Moving Average
Averages data over a window to reduce noise.
o Exponential Smoothing
Assigns more weight to recent observations.
3. Stationarity Check
Time series must be stationary (constant mean & variance) for many models.
Use tests like ADF (Augmented Dickey-Fuller).
4. Differencing
Used to make a non-stationary time series stationary.
Linear system analysis deals with studying systems whose behavior can be described using linear equations. It’s
commonly applied in engineering, control systems, signal processing, and mathematics.
1. Linearity
The system satisfies:
• Communication systems