0% found this document useful (0 votes)
14 views30 pages

AIML

The document provides an overview of unsupervised and supervised learning in machine learning, detailing techniques like K-Means clustering and Principal Component Analysis (PCA), along with their applications and challenges. It also covers various algorithms such as Linear Regression, Logistic Regression, and Decision Trees, highlighting their advantages and disadvantages. Additionally, it discusses statistical concepts, exploratory data analysis, and Python libraries relevant to machine learning.

Uploaded by

Shukla Aayush
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views30 pages

AIML

The document provides an overview of unsupervised and supervised learning in machine learning, detailing techniques like K-Means clustering and Principal Component Analysis (PCA), along with their applications and challenges. It also covers various algorithms such as Linear Regression, Logistic Regression, and Decision Trees, highlighting their advantages and disadvantages. Additionally, it discusses statistical concepts, exploratory data analysis, and Python libraries relevant to machine learning.

Uploaded by

Shukla Aayush
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Tab 1

UNIT : - 5

Unsupervised Learning

Unsupervised learning is a type of machine learning where the model is trained on unlabeled
data. Unlike supervised learning, where the model learns from labeled data, unsupervised
learning finds patterns, relationships, or structures in data without explicit guidance.

Problems in Unsupervised Learning

1.​ No Clear Accuracy Measure – Unlike supervised learning, there is no straightforward


way to evaluate the model's accuracy.​

2.​ Difficult to Interpret – Since there are no predefined labels, understanding and
interpreting the results can be challenging.​

3.​ Scalability Issues – Many unsupervised learning algorithms, such as clustering,


struggle with large datasets due to computational complexity.​

4.​ Overfitting – Without labeled data, models may overfit to noise rather than learning
meaningful patterns.​

K-Means Clustering

K-Means is an unsupervised clustering algorithm used to group data into K clusters. It works
by:

1.​ Selecting K random centroids.​

2.​ Assigning each data point to the nearest centroid.​

3.​ Recalculating centroids based on the assigned data points.​

4.​ Repeating steps 2-3 until centroids stop changing or a maximum number of iterations is
reached.​

Use Cases:

●​ Customer segmentation​
●​ Anomaly detection​

●​ Image compression​

Challenges:

●​ Choosing the right K value is difficult.​

●​ Sensitive to outliers.​

●​ May converge to local minima.​

Principal Component Analysis (PCA)

PCA is a dimensionality reduction technique used in machine learning to transform a dataset


into a lower-dimensional space while preserving as much variance as possible.

Steps of PCA:

1.​ Standardize the dataset.​

2.​ Compute the covariance matrix.​

3.​ Find the eigenvalues and eigenvectors.​

4.​ Select the top principal components.​

5.​ Transform the data into the new feature space.​

Applications:

●​ Image compression​

●​ Noise reduction​

●​ Feature extraction​

Advantages:
●​ Reduces computational cost.​

●​ Removes correlation among features.​

Disadvantages:

●​ Loss of interpretability.​

●​ Can discard useful information.​

Different Libraries of Python for Machine Learning

Python offers many libraries for machine learning, including:

1.​ NumPy – Provides support for large multidimensional arrays and numerical
computations.​

2.​ Pandas – Used for data manipulation and analysis.​

3.​ Matplotlib & Seaborn – Used for data visualization.​

4.​ Scikit-learn – Provides simple and efficient tools for machine learning, including
classification, regression, and clustering.​

5.​ TensorFlow & PyTorch – Used for deep learning and neural networks.​

6.​ Keras – A high-level API for deep learning, built on TensorFlow.​

7.​ Statsmodels – Used for statistical modeling and hypothesis testing.​

Would you like any of these topics explained in more detail? 😊


UNIT :- 4

Supervised Learning

Supervised learning is a type of machine learning where a model is trained on labeled data. The
algorithm learns from input-output pairs and makes predictions on new data.
Problems in Supervised Learning

1.​ Requires Labeled Data – Labeling data is expensive and time-consuming.​

2.​ Overfitting – The model may learn noise instead of actual patterns.​

3.​ Computational Cost – Large datasets require high processing power.​

4.​ Bias in Data – If the training data is biased, the model may make incorrect predictions.​

5.​ Limited Generalization – The model may not perform well on unseen data.​

Classification vs. Regression


Feature Classification Regression

Definition Predicts discrete categories (e.g., Predicts continuous values (e.g.,


spam or not spam). house price).

Output Type Categorical (labels). Continuous (numerical values).

Example Logistic Regression, Decision Trees, Linear Regression, Polynomial


Algorithms SVM. Regression.

Use Cases Fraud detection, sentiment analysis. Stock price prediction,


temperature forecasting.

Linear Regression

Linear Regression is a regression algorithm that models the relationship between independent
(X) and dependent (Y) variables using a straight line:

Y=mX+bY = mX + b

where:

●​ mm = slope (coefficient)​

●​ bb = intercept​

Applications:
●​ House price prediction​

●​ Sales forecasting​

Limitations:

●​ Assumes a linear relationship.​

●​ Sensitive to outliers.​

Logistic Regression

Logistic Regression is a classification algorithm used to predict categorical outcomes. Instead


of a straight line, it uses the sigmoid function:

P(Y)=11+e−(mX+b)P(Y) = \frac{1}{1 + e^{-(mX + b)}}

Applications:

●​ Spam detection​

●​ Medical diagnosis​

Advantages:

●​ Simple and effective for binary classification.​

●​ Outputs probabilities.​

Disadvantages:

●​ Doesn't work well for non-linear relationships.​

●​ Sensitive to outliers.​

Polynomial Regression
Polynomial Regression is an extension of Linear Regression where the relationship between
variables is non-linear. It fits a polynomial equation:

Y=a0+a1X+a2X2+a3X3+...+anXnY = a_0 + a_1X + a_2X^2 + a_3X^3 + ... + a_nX^n

Applications:

●​ Weather prediction​

●​ Stock market analysis​

Advantages:

●​ Captures non-linear relationships.​

Disadvantages:

●​ Overfitting with high-degree polynomials.​

Decision Tree

A Decision Tree is a tree-like model used for both classification and regression. It splits data
based on feature conditions.

How It Works:

1.​ Select the best feature to split data (using Gini impurity or entropy).​

2.​ Split the dataset into subsets.​

3.​ Repeat until reaching a stopping condition (e.g., max depth).​

Advantages:

●​ Easy to understand.​

●​ Handles both numerical and categorical data.​

Disadvantages:
●​ Prone to overfitting.​

●​ Unstable with small data changes.​

Random Forest

Random Forest is an ensemble learning method that uses multiple decision trees to improve
accuracy.

How It Works:

1.​ Create multiple decision trees using random subsets of data.​

2.​ Combine the outputs using majority voting (for classification) or averaging (for
regression).​

Advantages:

●​ Reduces overfitting.​

●​ Handles missing values well.​

Disadvantages:

●​ Computationally expensive.​

●​ Hard to interpret compared to a single tree.​

Naïve Bayes

Naïve Bayes is a probabilistic classifier based on Bayes' Theorem:

P(A∣B)=P(B∣A)P(A)P(B)P(A|B) = \frac{P(B|A)P(A)}{P(B)}

It assumes that features are independent, which simplifies calculations.

Applications:
●​ Spam filtering​

●​ Sentiment analysis​

Advantages:

●​ Fast and efficient.​

●​ Works well with small datasets.​

Disadvantages:

●​ Assumes feature independence (which may not always be true).​

Support Vector Machine (SVM)

SVM is a classification algorithm that finds the optimal hyperplane to separate data points.

Key Concepts:

●​ Margin: The distance between the hyperplane and the closest points.​

●​ Kernel Trick: Allows SVM to handle non-linear data.​

Applications:

●​ Image classification​

●​ Text categorization​

Advantages:

●​ Works well with high-dimensional data.​

●​ Robust against overfitting.​

Disadvantages:
●​ Computationally expensive for large datasets.​

●​ Difficult to tune hyperparameters.​

Would you like more details on any of these? 😊

UNIT :- 3

Concept of Probability and Its Types

Probability measures the likelihood of an event occurring, represented as a number between 0


and 1.

P(A)=Favorable OutcomesTotal OutcomesP(A) = \frac{\text{Favorable Outcomes}}{\text{Total


Outcomes}}

Types of Probability:

1.​ Classical Probability – Assumes all outcomes are equally likely. (e.g., rolling a fair die)​

2.​ Empirical Probability – Based on observations from experiments.​

3.​ Subjective Probability – Based on personal judgment or intuition.​

4.​ Conditional Probability – The probability of event A occurring given that B has already
happened.​

Descriptive vs. Inferential Statistics


Feature Descriptive Statistics Inferential Statistics

Definition Summarizes and organizes data. Draws conclusions from data.

Technique Mean, median, mode, standard Hypothesis testing, confidence


s deviation. intervals.

Purpose Describes a dataset. Makes predictions about a population.


Example Average height of students in a class. Predicting election results from a
sample.

Types of Inferential Statistics

1.​ Estimation​

○​ Point Estimation – A single value estimate (e.g., sample mean).​

○​ Interval Estimation – A range of values (e.g., confidence intervals).​

2.​ Hypothesis Testing​

○​ Null Hypothesis (H0H_0) – No effect or relationship exists.​

○​ Alternative Hypothesis (H1H_1) – A significant effect exists.​

○​ Uses t-tests, chi-square tests, ANOVA, etc.​

3.​ Regression Analysis – Determines relationships between variables.​

4.​ ANOVA (Analysis of Variance) – Compares multiple group means.​

5.​ Chi-Square Test – Tests relationships between categorical variables.​

Random Variables and Its Types

A random variable represents numerical outcomes of a random experiment.

Types:

1.​ Discrete Random Variable – Takes countable values (e.g., number of heads in coin
flips).​

2.​ Continuous Random Variable – Takes infinite values within a range (e.g.,
temperature).​
Central Limit Theorem (CLT) and Its Rules

The Central Limit Theorem (CLT) states that the distribution of the sample mean approaches a
normal distribution as the sample size increases, regardless of the population distribution.

Rules of CLT:

1.​ The sample size should be sufficiently large (n≥30n \geq 30).​

2.​ The population can be of any distribution, but the sample mean will be approximately
normal.​

3.​ The mean of the sample distribution equals the population mean (μ\mu).​

4.​ The standard deviation of the sample mean is given by:​


σXˉ=σn\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}

Sampling Distribution and Its Types

A sampling distribution is the probability distribution of a statistic based on repeated samples


from a population.

Types:

1.​ Sampling Distribution of the Mean – Distribution of sample means.​

2.​ Sampling Distribution of the Proportion – Distribution of sample proportions.​

3.​ t-Distribution – Used when the sample size is small.​

4.​ Chi-Square Distribution – Used for variance estimation.​

Cross-Validation and Its Types

Cross-validation is a technique used to evaluate machine learning models by splitting data into
training and testing sets multiple times.

Types:

1.​ K-Fold Cross-Validation – Splits data into K subsets and trains the model K times.​
2.​ Leave-One-Out Cross-Validation (LOO-CV) – Each observation is used as a test set
while the rest form the training set.​

3.​ Stratified K-Fold – Ensures each fold has the same class proportion.​

4.​ Time-Series Cross-Validation – Used for time-dependent data, preserving


chronological order.​

Bayes’ Theorem and Its Importance

Bayes’ Theorem describes the probability of an event based on prior knowledge of related
conditions.

P(A∣B)=P(B∣A)P(A)P(B)P(A|B) = \frac{P(B|A) P(A)}{P(B)}

Importance:

1.​ Used in spam filtering (probability of spam given specific words).​

2.​ Applied in medical diagnosis (probability of a disease given symptoms).​

3.​ Essential for machine learning models in probabilistic reasoning.​

4.​ Forms the foundation of Naïve Bayes classifiers.​

Would you like detailed examples for any of these topics? 😊


UNIT : - 2

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is the process of analyzing datasets to summarize their main
characteristics, often using visualizations and statistical techniques.

Steps in EDA:

1.​ Understanding Data – Checking data types, missing values, and distributions.​
2.​ Summary Statistics – Computing measures like mean, median, and standard deviation.​

3.​ Visualizations – Using histograms, scatter plots, box plots, etc.​

4.​ Handling Outliers – Identifying and managing extreme values.​

5.​ Correlation Analysis – Checking relationships between variables.​

Importance:

●​ Helps detect patterns and trends.​

●​ Identifies missing values and outliers.​

●​ Guides feature selection for machine learning.​

Descriptive Statistics

Descriptive statistics summarize and organize data without drawing conclusions.

Types of Descriptive Statistics:

1.​ Measures of Central Tendency – Mean, Median, Mode.​

2.​ Measures of Dispersion – Range, Variance, Standard Deviation.​

3.​ Measures of Shape – Skewness and Kurtosis.​

Difference Between Data and Histogram


Feature Data Histogram

Definition Raw collection of facts and figures. A graphical representation of data


distribution.

Representatio Stored in tables, spreadsheets, Displayed using bars to represent


n databases. frequency.

Example List of students' ages. A bar chart showing age distribution.


Purpose Used for processing, analysis, and Used to visualize frequency
storage. distributions.

3Ms (Mean, Median, Mode)

The 3Ms are measures of central tendency that describe the "center" of data.

1.​ Mean (Average)​


Mean=∑XnMean = \frac{\sum X}{n}
○​ Affected by outliers.​

○​ Used for numerical data with normal distribution.​

2.​ Median (Middle Value)​

○​ The middle value when data is sorted.​

○​ Not affected by outliers.​

3.​ Mode (Most Frequent Value)​

○​ The most frequently occurring value in a dataset.​

○​ Used for categorical data.​

Measure of Dispersion

Measures of dispersion describe how spread out the data is.

Types:

1.​ Range – Difference between the highest and lowest value.​


Range=Max−MinRange = \text{Max} - \text{Min}
2.​ Variance – Measures how far data points deviate from the mean.​
σ2=∑(X−μ)2n\sigma^2 = \frac{\sum (X - \mu)^2}{n}
3.​ Standard Deviation – Square root of variance, gives spread in original units.​
σ=∑(X−μ)2n\sigma = \sqrt{\frac{\sum (X - \mu)^2}{n}}
4.​ Interquartile Range (IQR) – Measures spread within the middle 50% of data.​
IQR=Q3−Q1IQR = Q3 - Q1
5.​ Coefficient of Variation (CV) – Compares spread between different datasets.​
CV=Standard DeviationMean×100CV = \frac{\text{Standard Deviation}}{\text{Mean}}
\times 100

5-Number Summary (Box Plot Summary)

A 5-number summary describes key characteristics of a dataset using:

1.​ Minimum – Smallest value in the dataset.​

2.​ First Quartile (Q1) – 25th percentile.​

3.​ Median (Q2) – 50th percentile.​

4.​ Third Quartile (Q3) – 75th percentile.​

5.​ Maximum – Largest value in the dataset.​

Box Plot Components:

●​ Box – Represents IQR (middle 50% of data).​

●​ Whiskers – Extend to minimum and maximum (excluding outliers).​

●​ Outliers – Plotted as individual points beyond whiskers.​

Importance:

●​ Helps visualize data spread and skewness.​

●​ Identifies outliers easily.​

Would you like detailed examples or Python code for these concepts? 😊
UNIT:-1

1. Definition of AI, Applications & Explanation of One


Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that can
learn, reason, problem-solve, and make decisions.

Applications of AI:

1.​ Healthcare – AI diagnoses diseases, predicts patient outcomes, and assists in drug
discovery.​

2.​ Finance – Fraud detection, risk assessment, and algorithmic trading.​

3.​ Self-driving Cars – AI powers autonomous vehicles by recognizing objects and making
driving decisions.​

4.​ Chatbots & Virtual Assistants – Used in customer service (e.g., Siri, Alexa).​

5.​ E-commerce & Recommendation Systems – AI suggests products based on user


behavior.​

6.​ Robotics – AI-driven robots automate industrial and household tasks.​

✅ Example Explanation: AI in Healthcare


●​ AI models like IBM Watson analyze medical records to assist doctors.​

●​ AI-based imaging tools detect tumors in MRIs or X-rays.​

●​ AI-powered chatbots provide basic medical guidance to patients.​

2. Problem Characteristics of AI

AI problems have unique characteristics that determine how they are solved.

Key Problem Characteristics:

1.​ Decomposability – Can the problem be broken into smaller subproblems?​

2.​ Ignorability of Steps – Do previous steps matter for the final solution?​

3.​ Solution Type – Is the best solution absolute (fixed) or relative (depends on the
scenario)?​
4.​ State vs. Path Solution – Does solving the problem require a final state or a sequence
of steps?​

5.​ Role of Knowledge – Does solving the problem require domain knowledge?​

✅ Analysis of AI Problems:
Problem Decomposabl Ignore Solution State or Role of
e? Steps? Type Path? Knowledge?

8-Puzzle Yes No Relative Path Minimal

Chess Yes No Relative Path High

Tower of Yes No Absolute Path Minimal


Hanoi

3. Difference Between ANN and BNN


Feature Artificial Neural Network (ANN) Biological Neural Network
(BNN)

Definition A computational model mimicking The real neural network in the


human brain neurons. human brain.

Components Neurons, weights, activation functions. Neurons, synapses, axons, and


dendrites.

Learning Type Machine learning algorithms. Learning through experience and


neuroplasticity.

Processing Fast, but limited. Extremely powerful and adaptive.


Speed

Flexibility Can be trained for specific tasks. Can learn new tasks without
retraining.

4. Types of Learning in AI

AI learns from data through different types of learning:

1. Supervised Learning
●​ Uses labeled data.​

●​ Example: Email spam detection.​

2. Unsupervised Learning

●​ Uses unlabeled data to find patterns.​

●​ Example: Customer segmentation.​

3. Reinforcement Learning

●​ Learns through rewards and penalties.​

●​ Example: AlphaGo (game-playing AI).​

4. Semi-Supervised Learning

●​ Combines both labeled and unlabeled data.​

●​ Example: Google Photos automatically tagging people.​

5. Difference Between Supervised and Unsupervised Learning


Feature Supervised Learning Unsupervised Learning

Definition Learns from labeled data. Learns from unlabeled data.

Goal Predicts outcomes. Finds patterns in data.

Example Spam detection. Clustering customers.

Algorithms Decision Trees, SVM, Neural Networks. K-Means, PCA, Autoencoders.

6. Elements of Data Science

1.​ Data Collection – Gathering raw data from various sources.​


2.​ Data Cleaning – Removing missing values and inconsistencies.​

3.​ Exploratory Data Analysis (EDA) – Understanding data distribution.​

4.​ Feature Engineering – Selecting and transforming features.​

5.​ Model Building – Using ML algorithms for predictions.​

6.​ Evaluation & Deployment – Checking performance and deploying models.​

7. Data Visualization Techniques

Data visualization helps interpret complex datasets using graphical representation.

Common Techniques:

1.​ Bar Chart – Compares categories.​

2.​ Histogram – Shows frequency distribution.​

3.​ Box Plot – Displays distribution and outliers.​

4.​ Scatter Plot – Shows relationships between variables.​

5.​ Heatmap – Represents data density using colors.​

Would you like detailed Python code for any of these topics? 😊
PREVIOUS MID SEM PAPER

Here are the answers to your questions:

1. What is AI? List out the types of AI and Explain them in detail. (3 Marks)

Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that can
perform tasks that typically require human intelligence, such as problem-solving,
decision-making, learning, and understanding language.

Types of AI
AI is classified into the following types:

1.​ Based on Capability​

○​ Narrow AI (Weak AI): Designed for specific tasks (e.g., chatbots,


recommendation systems).​

○​ General AI (Strong AI): Machines that can perform any intellectual task like
humans (still theoretical).​

○​ Super AI: Hypothetical AI surpassing human intelligence in all aspects.​

2.​ Based on Functionality​

○​ Reactive Machines: No memory, only react to situations (e.g., IBM's Deep Blue).​

○​ Limited Memory: Can use past data for decision-making (e.g., self-driving cars).​

○​ Theory of Mind AI: Can understand emotions and thoughts (under research).​

○​ Self-Aware AI: AI with its own consciousness (hypothetical).​

2. Differentiate Artificial Intelligence and Machine Learning. (3 Marks)


Feature Artificial Intelligence (AI) Machine Learning (ML)

Definition AI is a broad field that enables ML is a subset of AI that allows machines


machines to mimic human to learn from data without explicit
intelligence. programming.

Purpose Decision-making and Learning from patterns in data


problem-solving

Techniques Includes ML, deep learning, Includes supervised, unsupervised, and


Used expert systems, etc. reinforcement learning

Example Chatbots, Robotics, Self-driving Recommendation systems, Fraud


cars detection

3. Roll two dice and observe two numbers X and Y. (3 Marks)

The sample space for rolling two dice contains 6×6=366 \times 6 = 36 possible outcomes.
(a) Find P(X=2,Y=6)P(X=2, Y=6)

Only one outcome satisfies this condition: (2,6)(2,6).

P(X=2,Y=6)=136P(X=2, Y=6) = \frac{1}{36}

(b) Find P(X>3∣Y=2)P(X>3 \mid Y=2)

Given that Y=2Y=2, the possible values for XX are {1,2,3,4,5,6}.​


Favorable cases for X>3X>3 are {4,5,6}, which are 3 cases.​
Total cases where Y=2Y=2 are 6.

P(X>3∣Y=2)=36=12P(X>3 \mid Y=2) = \frac{3}{6} = \frac{1}{2}

4. Discuss Poisson Distribution. (3 Marks)

Poisson Distribution models the probability of a given number of events occurring in a fixed
interval of time or space, assuming the events occur independently and at a constant rate.

Formula:

P(X=k)=e−λλkk!P(X = k) = \frac{e^{-\lambda} \lambda^k}{k!}

Where:

●​ kk = number of occurrences​

●​ λ\lambda = average rate of occurrence​

●​ ee = Euler’s number (≈2.718)​

Example Application: Number of customer arrivals at a bank per minute.

5. List down applications of AI and Explain one in detail. (2 Marks)

Applications of AI:

●​ Healthcare (Diagnosis, Medical Imaging, Drug Discovery)​

●​ Finance (Fraud Detection, Algorithmic Trading)​


●​ E-commerce (Personalized Recommendations, Chatbots)​

●​ Automobile (Self-Driving Cars, Traffic Management)​

●​ Education (Automated Grading, Smart Tutors)​

Detailed Explanation: AI in Healthcare​


AI helps in diagnosing diseases using image analysis (e.g., detecting tumors in MRI scans). It
also assists in predicting disease outbreaks and developing drugs faster. AI-powered chatbots
provide preliminary medical advice, reducing the burden on healthcare professionals.

Q.2 (A) Discuss the difference between descriptive and inferential


statistics. (3 Marks)
Aspect Descriptive Statistics Inferential Statistics

Purpose Summarizes data Makes predictions or


generalizations about a
population

Technique Measures of central tendency (mean, Hypothesis testing, confidence


s median, mode), dispersion (variance, intervals, regression analysis
standard deviation)

Example Average test scores of a class Predicting exam performance


based on a sample

OR

Q.2 (A) State the central limit theorem.

The Central Limit Theorem (CLT) states that, regardless of the population distribution, the
distribution of the sample mean will approach a normal distribution as the sample size increases
(typically n>30n > 30).

If X1,X2,...,Xn are i.i.d. with mean μ and variance σ2, then Xˉ−μσ/n≈N(0,1) as n→∞.\text{If }
X_1, X_2, ..., X_n \text{ are i.i.d. with mean } \mu \text{ and variance } \sigma^2, \text{ then }
\frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \approx N(0,1) \text{ as } n \to \infty.

Q.2 (B) Define Exploratory Data Analysis and explain its importance in data
analysis.
Exploratory Data Analysis (EDA) is a statistical approach used to analyze datasets to
summarize key characteristics, identify patterns, and detect anomalies before applying machine
learning models.

Importance:

●​ Helps in data cleaning and preprocessing​

●​ Identifies missing values and outliers​

●​ Helps understand relationships between variables​

●​ Provides insights for feature engineering​

OR

Q.2 (B) Define the Range and explain its calculation methods.

Definition:​
The Range is the difference between the maximum and minimum values in a dataset.

Formula:

Range=Max Value−Min Value\text{Range} = \text{Max Value} - \text{Min Value}

Example Calculation:​
For dataset {10, 22, 45, 68, 92},

Range=92−10=82\text{Range} = 92 - 10 = 82

Q.2 (C) Calculate the quartiles and find the interquartile range (IQR) for the
given dataset.

Dataset:​
18, 34, 68, 22, 10, 92, 46, 52, 38, 29, 45, 37, 10, 30, 50, 70, 90

Step 1: Arrange in Ascending Order

10, 10, 18, 22, 29, 30, 34, 37, 38, 45, 46, 50, 52, 68, 70, 90, 92

Step 2: Calculate Quartiles

●​ Q1 (First Quartile, 25th Percentile): Median of the first half → Q1=29Q1 = 29​
●​ Q2 (Median, 50th Percentile): Middle value → Q2=38Q2 = 38​

●​ Q3 (Third Quartile, 75th Percentile): Median of the second half → Q3=68Q3 = 68​

Step 3: Calculate IQR


IQR=Q3−Q1=68−29=39

Here are the answers to your questions:

Q.2 (C) Find the mean, median, mode, and standard deviation of the given
weights.

Given Data:

x1=3.5x_1 = 3.5, x2=12.3x_2 = 12.3, x3=17.7x_3 = 17.7, x4=20.9x_4 = 20.9, x5=23.1x_5 = 23.1

1. Mean (Average):
Mean=∑xin=3.5+12.3+17.7+20.9+23.15\text{Mean} = \frac{\sum x_i}{n} = \frac{3.5 + 12.3 + 17.7
+ 20.9 + 23.1}{5} =77.55=15.5 kg= \frac{77.5}{5} = 15.5 \text{ kg}

2. Median (Middle Value):

Since we have 5 values (odd number), the median is the middle value:

Median=17.7 kg\text{Median} = 17.7 \text{ kg}

3. Mode (Most Frequent Value):

Since all values are unique, there is no mode.

4. Standard Deviation (σ):


σ=∑(xi−xˉ)2n\sigma = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n}}

First, calculate deviations from the mean (xˉ=15.5\bar{x} = 15.5):

(3.5−15.5)2=(−12)2=144(3.5 - 15.5)^2 = (-12)^2 = 144 (12.3−15.5)2=(−3.2)2=10.24(12.3 -


15.5)^2 = (-3.2)^2 = 10.24 (17.7−15.5)2=(2.2)2=4.84(17.7 - 15.5)^2 = (2.2)^2 = 4.84
(20.9−15.5)2=(5.4)2=29.16(20.9 - 15.5)^2 = (5.4)^2 = 29.16 (23.1−15.5)2=(7.6)2=57.76(23.1 -
15.5)^2 = (7.6)^2 = 57.76 σ=144+10.24+4.84+29.16+57.765\sigma = \sqrt{\frac{144 + 10.24 +
4.84 + 29.16 + 57.76}{5}} σ=2465=49.2≈7.01 kg\sigma = \sqrt{\frac{246}{5}} = \sqrt{49.2}
\approx 7.01 \text{ kg}
Q.3 (A) Explain the difference between classification and regression
models.
Feature Classification Regression

Definition Assigns labels to data (e.g., cat vs. Predicts continuous values (e.g.,
dog) temperature)

Output Discrete values (e.g., 0 or 1) Continuous values (e.g., 45.6°C)

Example Spam detection (spam/not spam) Predicting house prices

Q.3 (B) Explain Linear Regression with an example.

Linear Regression is a statistical method to predict a continuous variable based on the


relationship between independent (X) and dependent (Y) variables. The equation is:

Y=mX+cY = mX + c

Where:

●​ YY = dependent variable​

●​ XX = independent variable​

●​ mm = slope​

●​ cc = intercept​

Example:​
Predicting house prices based on size (sq ft). If:

Price=5000×(Size)+20000\text{Price} = 5000 \times (\text{Size}) + 20000

Then for a house of 1000 sq ft:

Price=(5000×1000)+20000=5,020,000\text{Price} = (5000 \times 1000) + 20000 = 5,020,000

Q.3 (C) Explain the Decision Tree Algorithm with an example.

A Decision Tree is a tree-like model used for classification and regression. It splits data into
branches based on feature values.
Example:​
For predicting whether a student will pass an exam:

●​ If study hours > 3, then pass​

●​ If study hours ≤ 3, then fail​

Q.3 (A) Explain Polynomial Regression.

Polynomial Regression is a type of regression where the relationship between independent


and dependent variables is modeled as an nth-degree polynomial:

Y=a0+a1X+a2X2+...+anXnY = a_0 + a_1X + a_2X^2 + ... + a_nX^n

Used when data follows a curved pattern rather than a straight line.

Example: Predicting population growth using a quadratic equation.

Q.3 (B) Discuss the concept of ensemble learning and how it is utilized in
random forests.

Ensemble Learning combines multiple models to improve accuracy.

Random Forest is an ensemble of multiple Decision Trees. Each tree is trained on a random
subset of the data, and the final prediction is based on majority voting (classification) or
averaging (regression).

Advantages:

●​ Reduces overfitting​

●​ Improves accuracy​

Q.3 (C) Explain the concept of Support Vector Machine with an example.

A Support Vector Machine (SVM) is a supervised learning algorithm that finds the best
decision boundary (hyperplane) to classify data.
Example:​
For classifying emails as spam or not spam, SVM finds the best boundary between the two
categories.

Q.4 (Attempt any 4 out of 6, Each Question of 3 Marks)

(1) Which Evaluation Metrics do we use for the Classification Problem? Explain any
three.

1.​ Accuracy – Percentage of correctly classified instances.​

2.​ Precision – Ratio of true positives to total predicted positives.​

3.​ Recall – Ratio of true positives to actual positives.​

(2) Differentiate between supervised and unsupervised learning.


Feature Supervised Learning Unsupervised Learning

Labeled Data Uses labeled data Uses unlabeled data

Purpose Classification & regression Clustering & pattern discovery

Example Spam detection Customer segmentation

(3) Define the K-Means algorithm.

K-Means is a clustering algorithm that partitions data into K clusters based on feature similarity.
It minimizes the variance within each cluster.

Steps:

1.​ Select K cluster centers.​

2.​ Assign each point to the nearest cluster.​

3.​ Update the cluster centers and repeat until convergence.​

(4) Challenges and advantages of unsupervised learning compared to supervised


learning.

Advantages:
●​ No need for labeled data​

●​ Identifies hidden patterns​

●​ Useful for exploratory analysis​

Challenges:

●​ Hard to evaluate results​

●​ May group unrelated data​

●​ Needs fine-tuning​

(5) What is Dimensionality Reduction? List the methods to reduce dimensions.

Dimensionality Reduction reduces the number of features while preserving essential


information.

Methods:

1.​ Principal Component Analysis (PCA)​

2.​ t-SNE (t-Distributed Stochastic Neighbor Embedding)​

3.​ Autoencoders​

(6) What is a clustering method? List down the types of clustering.

Clustering is an unsupervised learning technique that groups similar data points together.

Types of Clustering:

1.​ Partitioning-based (K-Means)​

2.​ Hierarchical (Agglomerative, Divisive)​

3.​ Density-based (DBSCAN)​

4.​ Fuzzy Clustering (Fuzzy C-Means)​


😊
This provides detailed answers to your questions. Let me know if you need further explanations!

You might also like