Ds Unit 2
Ds Unit 2
Ans:
Descriptive statistics involve using numerical techniques to summarize and analyze data,
offering insights such as patterns and trends.
For eg : In a vehicle company's sales data, it helps determine the mean, median, mode of
selling prices or calculate total revenue from specific car models.
Purpose:
It helps understand the central tendency and dispersion of data, providing a comprehensive
overview useful for decision-making and data analysis.
1.Measures of Central Tendency – It represents the whole set of data by a single value. It
gives us the location of the central points. There are three main measures of central tendency:
a)Mean: It is the sum of observations divided by the total number of observations. It is also
defined as average which is the sum divided by count.
b)Median : It is the middle value of the data set. It splits the data into two halves. If the
number of elements in the data set is odd then the center element is the median and if it is
even then the median would be the average of two central elements.
2.Measures of Variability – Measures of variability show how much the data points differ
from the average (mean). It indicates the spread, dispersion, or diversity of data. A high
variability means data points are widely spread, while low variability suggests they are close
to the mean.
3.Measures of Frequency Distribution : Frequency distribution shows how often each
unique value appears in a dataset. It helps to identify patterns, spot trends, and understand
the distribution of data points.
a) Frequency Count
b) Relative Frequency
• Shows the proportion of times a value occurs relative to the total number of
observations.
c) Cumulative Frequency
Ans:
Machine Learning
• Machine learning is Coined by Arthur Samuel in 1959 at IBM, who defined it as "the field
of study that gives computers the ability to learn without being explicitly programmed."
• Machine Learning (ML) is the process of programming computers to improve performance
based on example data or past experiences.
• It can be predictive (making future predictions) or descriptive (gaining insights from
data).
• ML is a subset of Artificial Intelligence (AI), focusing on enabling machines to make
decisions using data.
Definition of Learning in ML
A program learns from experience (E) regarding a set of tasks (T) and a performance
measure (P) if its performance improves with experience.
2. Unsupervised Learning
In unsupervised learning, the machine trains on unlabeled data to discover patterns and
relationships.
Ans:
• Supervised learning is a machine learning approach where models learn from labeled
data—each input has a corresponding output. The algorithm learns to map inputs to the
correct output by minimizing prediction errors.
Example:
• In email spam detection, the model learns to classify emails based on patterns from labeled
examples.
• Predicting house prices based on features like size, location, and number of rooms.
1. Data Acquisition: Collect relevant labeled data from various sources (databases, APIs,
surveys).
2. Data Cleaning: Handle missing values, remove duplicates, and correct inconsistencies
to ensure data quality.
3. Data Splitting: Divide the dataset into training data (to train the model) and test data
(to evaluate the model).
4. Model Training and Building: Train the model using the labeled training data to learn
patterns and relationships.
5. Model Testing: Test the model's performance on unseen data (test data) to evaluate
accuracy and generalization.
6. Model Deployment: Integrate the trained model into a real-world system for making
predictions on new, live data.
1. Classification
2. Regression
1.Classification: Categorizes data into predefined classes or categories. The goal is to map
input variables to discrete output variables.
• Support Vector Machine (SVM): Finds the best hyperplane that separates classes in an
N-dimensional space.
• K-Nearest Neighbours (KNN): Classifies data based on the closest neighbors in the
dataset.
• Random Forest: Combines multiple decision trees to improve accuracy and reduce
overfitting.
Equation:
Where:
• y= Dependent variable
• x = Independent variable
• m = Slope (coefficient)
• c = Intercept
Examples:
• Linear Regression: Models the linear relationship between input variables and the
output.
• Decision Tree Regression: Splits data into branches to predict continuous values.
• Support Vector Regression (SVR): Uses SVM principles to predict continuous outputs
within a certain error margin.
Ans:
• Unsupervised learning is a type of machine learning that works with unlabeled data,
meaning the data lacks predefined labels or categories.
• Its primary goal is to uncover hidden patterns, structures, or relationships within the data
without explicit guidance.
1. Clustering
Clustering groups similar objects into clusters while ensuring distinct groups remain different.
Each object is defined by features, and the process relies on measuring distances (e.g.,
Euclidean) between objects to determine similarity. This technique is used in customer
segmentation, image recognition, and anomaly detection.
Algorithms:
a) K-Means Clustering
2. Dimensionality Reduction
Methods:
• Feature Selection: Chooses the most relevant existing features.
• Feature Extraction: Combines features to create new ones (e.g., using PCA).
Algorithm:
Other techniques include both linear (e.g., PCA) and nonlinear methods, which are
increasingly gaining popularity.
An unsupervised learning method that finds relationships between variables in large datasets.
Commonly used in market basket analysis to discover patterns like "If a customer buys bread,
they might also buy butter."
Algorithms:
Apriori: Identifies frequent item sets and builds rules based on minimum support.
Eclat: Uses depth-first search and intersection for faster, more efficient pattern
discovery.
Q5. Explain Bias and Variance and trade off
Ans:
1. Definition: Bias is the difference between the predicted values by a Machine Learning
model and the actual values.
2. Effect: High bias leads to large errors in both training and testing data.
2.Variance
1. Definition: Variance measures how much a model's predictions change when trained
on different datasets.
2. Effect: High variance leads to good performance on training data but high errors on
unseen data.
3. Recommendation: Models should have low variance to avoid overfitting and improve
generalization.
• If the algorithm is too simple (hypothesis with a linear equation), it may fall into a high
bias and low variance condition, making it error-prone.
• On the other hand, if the algorithm is too complex (hypothesis with a high-degree
equation), it may result in high variance and low bias. In this latter case, the model will
not perform well on new, unseen data.
• There is a balance between these two extremes, known as the Trade-off or Bias-Variance
Trade-off.
• This trade-off arises because an algorithm cannot be both highly complex and overly
simple at the same time.
• In terms of a graph, the perfect trade-off appears as a balance point between bias and
variance, where the model achieves the best performance.
Q5. Explain Overfitting and Underfitting in detail.
Ans:
• A statistical model or machine learning algorithm is said to have underfitting when it is too
simple to capture the complexities of the data.
• This represents the model's inability to learn from the training data effectively, leading to
poor performance on both training and testing data.
• In simple terms, an underfit model is inaccurate, especially when applied to new, unseen
examples.
• This usually occurs when using a very simple model with overly simplified assumptions.
• To address underfitting, one should use more complex models, enhance feature
representation, and reduce regularization constraints.
• The model is too simple and cannot represent the complexities of the data.
• Input features used for training are not adequate representations of the underlying factors
affecting the target variable.
• The training dataset is too small.
• Excessive regularization restricts the model from capturing the data effectively.
• Features are not properly scaled.
here,
• x = input value
• y = predicted output
• b0 = bias or intercept term
• b1 = coefficient for input (x)
4. Stepwise Regression
Stepwise regression is a technique used when dealing with multiple independent variables. It
automatically selects the most significant variables based on statistical criteria, with no human
intervention.
The process relies on evaluating statistical metrics such as:
• R-squared: Measures the proportion of variance explained by the model.
• t-statistics: Tests the significance of individual predictors.
• AIC (Akaike Information Criterion): Assesses the model's quality while penalizing
complexity.
How It Works:
Stepwise regression fits the model by adding or removing variables one at a time based on
predefined criteria, refining the model iteratively.
Common Stepwise Methods:
1. Standard Stepwise Regression: Adds and removes predictors at each step based on
their significance.
2. Forward Selection: Starts with no variables and adds the most significant predictor at
each step.
3. Backward Elimination: Starts with all variables and removes the least significant one
at each step.
Objective: The goal is to maximize prediction accuracy while using the fewest possible
predictors, making this method particularly useful for handling high-dimensional datasets.
Q7. Explain Cross Validation and its techniques.
Ans:
Cross-Validation (CV)
It helps ensure that the model generalizes well on unseen data and reduces problems like
overfitting.
Techniques:
How it works:
How It Works:
How It Works:
• Randomly divide the dataset into two parts (commonly 70% for training and 30% for
testing).
• Train the model using the training set.
• Test the model on the testing set to evaluate its performance.
4. Hyperparameter Tuning
Hyperparameter tuning is the process of adjusting parameters that are set manually before
training a machine learning model. These parameters, unlike model parameters, are not
learned from the data but are defined by the programmer and can significantly impact model
performance.
Examples of Hyperparameters:
The goal of hyperparameter tuning is to find the best combination of hyperparameters that
maximizes model performance. This is typically treated as a search problem.
Hyperparameter Tuning Strategies
a) Grid Search CV
• Evaluates model performance for each combination and selects the best one.
Example:
If tuning hyperparameters C and Alpha for Logistic Regression:
The model evaluates all combinations. If the best performance score (e.g., 0.726) comes from C
= 0.3 and Alpha = 0.2, that combination is selected.
Drawback:
b) Randomized Search CV
Advantage:
1. Initial Parameters: It starts with an arbitrary set of model parameters (weights and
biases) to evaluate performance.
2. Derivative (Slope): From the starting point, the algorithm calculates the derivative
(slope) of the cost function.
3. Tangent Line: The slope is used to create a tangent line that shows the steepness of the
curve at that point.
4. Updating Parameters: The algorithm adjusts the parameters (weights and biases) in
the direction of the negative gradient, aiming for the local minimum or global
minimum of the cost function.
5. Learning Rate: The learning rate (denoted as η) controls the size of the steps taken
toward the minimum. A higher learning rate results in larger steps, which can lead to
overshooting the minimum, while a smaller learning rate offers more precision but may
require more iterations to converge.
The process continues until the cost function reaches its minimum (or close to zero), at which
point the model stops adjusting its parameters.
Types of Gradient Descent
There are three primary variations of gradient descent, each with its unique advantages and
trade-offs:
1. Batch Gradient Descent:
Description: Computes the gradient of the cost function using the entire training
dataset. The parameters are updated after the complete dataset is processed.
Pros: Stable convergence and a smooth gradient.
Cons: Can be computationally expensive and slow, especially for large datasets,
since it requires storing the entire dataset in memory.
2. Stochastic Gradient Descent (SGD):
Description: Updates the parameters after processing each individual training
sample. This results in more frequent updates.
Pros: Faster since it doesn’t require processing the entire dataset at once, making
it memory-efficient.
Cons: The updates can be noisy, leading to fluctuations in the convergence path.
However, this can help escape local minima and find the global minimum.
3. Mini-Batch Gradient Descent:
Description: Combines aspects of both batch and stochastic gradient descent by
splitting the dataset into small batches and updating parameters after each batch.
Pros: Offers a balance between computational efficiency and the speed of
convergence. It's faster than batch gradient descent while being more stable than
SGD.
Q9. Explain KNN
Ans:
1. K-Nearest Neighbors (K-NN) is a simple supervised learning algorithm mainly used for
classification but can also handle regression tasks.
2. Similarity-Based Classification: It classifies new data points based on their similarity
to stored data, assigning them to the most similar category.
3. Non-Parametric Nature: K-NN makes no assumptions about the underlying data
distribution.
4. Lazy Learner: It doesn’t learn during training but stores the dataset and classifies new
data when required.
5. On-the-Fly Classification: When new data appears, K-NN compares it to stored cases
and assigns it to the closest matching category.
Advantages of KNN:
1. Simple: Easy to understand and implement.
2. Versatile: Works with numerical and categorical data.
3. Non-parametric: No assumptions about data distribution.
Disadvantages of KNN:
1. Scalability: Slow with large datasets.
2. Curse of Dimensionality: Performance drops with more features.
3. Sensitive to Imbalance: Favors majority class in imbalanced data.
4. Feature Scaling Required: Affected by varying feature scales.
Q.10 Explain SVM
Ans: Support Vector Machines (SVMs) are supervised machine learning algorithms primarily
used for classification and regression tasks.
They work by identifying the optimal hyperplane that best separates data points of different
classes in a high-dimensional space.
Key Concepts:
1. Hyperplane: A decision boundary that separates data points of different classes. In two
dimensions, this is a line; in three dimensions, a plane; and in higher dimensions, a
hyperplane.
2. Support Vectors: Data points that are closest to the hyperplane and influence its
position and orientation. These points are critical in defining the optimal hyperplane.
3. Margin: The distance between the hyperplane and the nearest support vectors from
either class. SVM aims to maximize this margin to enhance the classifier's generalization
ability.
Types of SVM:
• Linear SVM: Used when data is linearly separable, meaning a straight line or
hyperplane can effectively separate the classes.
• Non-Linear SVM: Employed when data is not linearly separable. SVM uses kernel
functions to map data into higher-dimensional spaces where a linear separation is
possible.
Kernel Functions:
Kernel functions enable SVM to perform non-linear classification by implicitly mapping
input data into higher-dimensional spaces. Common kernels include:
• Linear Kernel: Suitable for linearly separable data.
• Polynomial Kernel: Captures interactions between features.
• Radial Basis Function (RBF) Kernel: Effective for cases where the relationship
between class labels and attributes is non-linear.
Advantages of SVM:
• Effective in High-Dimensional Spaces:
• Memory Efficiency
• Versatility: Can handle both linear and non-linear classification via kernel functions.
Disadvantages of SVM:
• Computational Complexity: Training can be slow with large datasets.
• Parameter Selection: Choosing the right kernel and tuning parameters can be difficult.
• Interpretability: The model can be complex and harder to interpret compared to
simpler algorithms.
Applications of SVM:
Text Classification , Image Recognition , Bioinformatics
Example: Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1
and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either
green or blue.
So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes.
Q.11 Explain Ensemble Learning
Ans:
• Ensemble learning is a powerful machine learning technique where multiple models,
known as hypotheses, are combined to make predictions.
• The key idea behind ensemble learning is that by aggregating the predictions of several
models, we can improve overall accuracy and robustness compared to using a single model.
• This method is especially useful for reducing the risk of misclassification.
The key idea is to train the same algorithm multiple times using different subsets of the
training data.
How it Works?
1. Data Sampling: Multiple subsets of the original dataset are created using bootstrap
sampling (i.e., sampling with replacement).
2. Model Training: Each subset is used to train a separate model. Importantly, all these
models are of the same type.
3. Parallel Learning: The models are learned independently and in parallel.
4. Aggregation: The final output prediction is averaged (for regression) or voted (for
classification) across all models.
Unlike Bagging, Boosting focuses on reducing bias and builds models sequentially,
where each model attempts to correct the errors of the previous one.
Example: AdaBoost
How it Works?
Components of an ANN:
• Input Layer: Receives the input signals and passes them on to the next layer.
• Hidden Layer(s): Intermediate layer(s) that perform computations and transfer
information from the input nodes to the output nodes.
• Output Layer: Delivers the final output of the neural network.
Working of an ANN:
1. Input Processing: Inputs are received by the input layer, each input associated with a
weight signifying its importance.
2. Weighted Sum: These inputs are multiplied by their respective weights and then
summed.
3. Adding Bias: A bias (akin to an intercept in linear models) is usually added to the
weighted sum to help the model fit the data better.
4. Activation Function: The result is passed through an activation function, which
determines the neuron's output. Common activation functions include:
o Sigmoid , Hyperbolic Tangent (Tanh), Rectified Linear Unit (ReLU)
5. Output Generation: The process continues through the network until the output layer
produces the final result.
Types of ANN:
1. Feedforward Neural Networks (FNNs): The simplest type of ANN, where the data
moves in one direction from input to output nodes.
2. Recurrent Neural Networks (RNNs): Designed for processing sequential data, the
outputs from neurons can loop back into the network, creating a 'memory' of previous
inputs.
3. Convolutional Neural Networks (CNNs): Primarily used in image recognition and
processing, they are structured to pick up on spatial hierarchies in data.
Applications of ANN:
• Image & Voice Recognition: Used in image classification, facial, and voice recognition.
• NLP: Applied in translation, sentiment analysis, and text generation.
• Predictive Analytics: Used in finance (stock prediction) and healthcare (disease
diagnosis).
Advantages of ANN:
• Learns Non-linear Relationships: Handles complex, non-linear data.
• Generalization: Predicts unseen data after training.
• Parallel Processing: Efficient in multitasking.
Disadvantages of ANN:
• Black Box Nature: Lacks transparency in decision-making.
• Hardware Dependent: Needs parallel processing power.
• Data & Computation Intensive: Requires large datasets and high computing power.
Q.13 Explain Decision Tress.
Ans:
• Decision Tree is a supervised learning technique that can be used for both classification
and regression problems, though it is mostly preferred for solving classification problems.
• It is a tree-structured classifier where internal nodes represent the features of a dataset,
branches represent decision rules, and each leaf node represents the outcome.
In a Decision Tree, there are two types of nodes:
• Decision Nodes: Used to make decisions and have multiple branches.
• Leaf Nodes: Represent the final outcome and do not contain any further branches.
• The decisions or tests are performed based on the features of the given dataset.
• A Decision Tree is a graphical representation that outlines all possible solutions to a
problem based on given conditions.
• It starts with a root node, which expands into further branches, forming a tree-like
structure.
• To build a tree, we use the CART (Classification and Regression Tree) algorithm.
• A Decision Tree asks a question, and based on the answer (Yes/No), it further splits the
tree into subtrees.
Example Use-Case:
A fruit image dataset is divided into subsets, each trained on different decision trees. For a
new image, the model predicts based on the majority decision of all trees.
Applications:
• Banking: Loan risk assessment.
• Medicine: Disease detection and risk analysis.
• Land Use: Identifying areas of similar usage.
• Marketing: Analyzing trends and consumer behavior.
Advantages:
✔ Supports both classification & regression.
✔ Handles large datasets & high-dimensional data efficiently.
✔ Reduces overfitting, improving generalization.
Disadvantages:
✖ Less effective for regression as averaging may lose details in continuous data.
Q.15 Explain the concept of Model Evaluation and Model Selection .
Ans:
Model Selection and Evaluation
• Model selection is the process of choosing the best algorithm based on performance
metrics to solve a specific problem.
• It involves comparing models to find the one with the highest accuracy and predictive
power.
• A good model balances fit and generalization—avoiding underfitting (too simple, poor
predictions) and overfitting (too complex, poor generalization).
Model Evaluation
1.Performance Metrics
The choice of evaluation metric depends on the type of problem:
• Classification: Accuracy, Precision, Recall, F1-Score, ROC-AUC.
• Regression: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean
Absolute Error (MAE), R² (Coefficient of Determination).
2.Additional Evaluation Techniques
• Confusion Matrix: Helps analyze false positives and false negatives in classification
problems.
• ROC Curve & AUC: Evaluates classification performance across different threshold
settings.
• Error Analysis: Identifies patterns in model errors to improve predictions.
• Cross-Validation: Ensures model consistency using techniques like k-fold and stratified
k-fold validation.
Model Selection
Key Steps in Selecting the Best Model
1. Experiment with Multiple Models: Compare different models (linear, ensemble,
neural networks) to determine the best fit.
2. Feature Importance & Selection: Identify and remove less relevant features to
improve efficiency.
3. Hyperparameter Tuning: Optimize model parameters using Grid Search, Random
Search, or Bayesian Optimization.
4. Model Validation: Validate performance on a separate dataset to check for overfitting.
5. Learning Curves Analysis: Identify underfitting (requires more data) or overfitting
(excessive complexity).
6. Cost-Benefit Analysis: Consider trade-offs between performance, computational cost,
and real-world constraints.
Considerations in Model Selection
• Bias-Variance Tradeoff: Balance between underfitting (high bias) and overfitting (high
variance).
• Interpretability vs. Performance: More complex models may have better accuracy but
lower interpretability.
• Computational Efficiency: Consider processing speed and resource usage for real-time
applications.
• Robustness and Generalizability: The model should perform consistently across
various sets of data and under different conditions