AIML
AIML
UNIT : - 5
Unsupervised Learning
Unsupervised learning is a type of machine learning where the model is trained on unlabeled
data. Unlike supervised learning, where the model learns from labeled data, unsupervised
learning finds patterns, relationships, or structures in data without explicit guidance.
2. Difficult to Interpret – Since there are no predefined labels, understanding and
interpreting the results can be challenging.
4. Overfitting – Without labeled data, models may overfit to noise rather than learning
meaningful patterns.
K-Means Clustering
K-Means is an unsupervised clustering algorithm used to group data into K clusters. It works
by:
4. Repeating steps 2-3 until centroids stop changing or a maximum number of iterations is
reached.
Use Cases:
● Customer segmentation
● Anomaly detection
● Image compression
Challenges:
● Sensitive to outliers.
Steps of PCA:
Applications:
● Image compression
● Noise reduction
● Feature extraction
Advantages:
● Reduces computational cost.
Disadvantages:
● Loss of interpretability.
1. NumPy – Provides support for large multidimensional arrays and numerical
computations.
4. Scikit-learn – Provides simple and efficient tools for machine learning, including
classification, regression, and clustering.
5. TensorFlow & PyTorch – Used for deep learning and neural networks.
Supervised Learning
Supervised learning is a type of machine learning where a model is trained on labeled data. The
algorithm learns from input-output pairs and makes predictions on new data.
Problems in Supervised Learning
2. Overfitting – The model may learn noise instead of actual patterns.
4. Bias in Data – If the training data is biased, the model may make incorrect predictions.
5. Limited Generalization – The model may not perform well on unseen data.
Linear Regression
Linear Regression is a regression algorithm that models the relationship between independent
(X) and dependent (Y) variables using a straight line:
Y=mX+bY = mX + b
where:
● mm = slope (coefficient)
● bb = intercept
Applications:
● House price prediction
● Sales forecasting
Limitations:
● Sensitive to outliers.
Logistic Regression
Applications:
● Spam detection
● Medical diagnosis
Advantages:
● Outputs probabilities.
Disadvantages:
● Sensitive to outliers.
Polynomial Regression
Polynomial Regression is an extension of Linear Regression where the relationship between
variables is non-linear. It fits a polynomial equation:
Applications:
● Weather prediction
Advantages:
Disadvantages:
Decision Tree
A Decision Tree is a tree-like model used for both classification and regression. It splits data
based on feature conditions.
How It Works:
1. Select the best feature to split data (using Gini impurity or entropy).
Advantages:
● Easy to understand.
Disadvantages:
● Prone to overfitting.
Random Forest
Random Forest is an ensemble learning method that uses multiple decision trees to improve
accuracy.
How It Works:
2. Combine the outputs using majority voting (for classification) or averaging (for
regression).
Advantages:
● Reduces overfitting.
Disadvantages:
● Computationally expensive.
Naïve Bayes
P(A∣B)=P(B∣A)P(A)P(B)P(A|B) = \frac{P(B|A)P(A)}{P(B)}
Applications:
● Spam filtering
● Sentiment analysis
Advantages:
Disadvantages:
SVM is a classification algorithm that finds the optimal hyperplane to separate data points.
Key Concepts:
● Margin: The distance between the hyperplane and the closest points.
Applications:
● Image classification
● Text categorization
Advantages:
Disadvantages:
● Computationally expensive for large datasets.
UNIT :- 3
Types of Probability:
1. Classical Probability – Assumes all outcomes are equally likely. (e.g., rolling a fair die)
4. Conditional Probability – The probability of event A occurring given that B has already
happened.
1. Estimation
Types:
1. Discrete Random Variable – Takes countable values (e.g., number of heads in coin
flips).
2. Continuous Random Variable – Takes infinite values within a range (e.g.,
temperature).
Central Limit Theorem (CLT) and Its Rules
The Central Limit Theorem (CLT) states that the distribution of the sample mean approaches a
normal distribution as the sample size increases, regardless of the population distribution.
Rules of CLT:
1. The sample size should be sufficiently large (n≥30n \geq 30).
2. The population can be of any distribution, but the sample mean will be approximately
normal.
3. The mean of the sample distribution equals the population mean (μ\mu).
Types:
Cross-validation is a technique used to evaluate machine learning models by splitting data into
training and testing sets multiple times.
Types:
1. K-Fold Cross-Validation – Splits data into K subsets and trains the model K times.
2. Leave-One-Out Cross-Validation (LOO-CV) – Each observation is used as a test set
while the rest form the training set.
3. Stratified K-Fold – Ensures each fold has the same class proportion.
Bayes’ Theorem describes the probability of an event based on prior knowledge of related
conditions.
Importance:
Exploratory Data Analysis (EDA) is the process of analyzing datasets to summarize their main
characteristics, often using visualizations and statistical techniques.
Steps in EDA:
1. Understanding Data – Checking data types, missing values, and distributions.
2. Summary Statistics – Computing measures like mean, median, and standard deviation.
Importance:
Descriptive Statistics
The 3Ms are measures of central tendency that describe the "center" of data.
Measure of Dispersion
Types:
Importance:
Would you like detailed examples or Python code for these concepts? 😊
UNIT:-1
Applications of AI:
1. Healthcare – AI diagnoses diseases, predicts patient outcomes, and assists in drug
discovery.
3. Self-driving Cars – AI powers autonomous vehicles by recognizing objects and making
driving decisions.
4. Chatbots & Virtual Assistants – Used in customer service (e.g., Siri, Alexa).
2. Problem Characteristics of AI
AI problems have unique characteristics that determine how they are solved.
2. Ignorability of Steps – Do previous steps matter for the final solution?
3. Solution Type – Is the best solution absolute (fixed) or relative (depends on the
scenario)?
4. State vs. Path Solution – Does solving the problem require a final state or a sequence
of steps?
5. Role of Knowledge – Does solving the problem require domain knowledge?
✅ Analysis of AI Problems:
Problem Decomposabl Ignore Solution State or Role of
e? Steps? Type Path? Knowledge?
Flexibility Can be trained for specific tasks. Can learn new tasks without
retraining.
4. Types of Learning in AI
1. Supervised Learning
● Uses labeled data.
2. Unsupervised Learning
3. Reinforcement Learning
4. Semi-Supervised Learning
Common Techniques:
Would you like detailed Python code for any of these topics? 😊
PREVIOUS MID SEM PAPER
1. What is AI? List out the types of AI and Explain them in detail. (3 Marks)
Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that can
perform tasks that typically require human intelligence, such as problem-solving,
decision-making, learning, and understanding language.
Types of AI
AI is classified into the following types:
○ General AI (Strong AI): Machines that can perform any intellectual task like
humans (still theoretical).
○ Reactive Machines: No memory, only react to situations (e.g., IBM's Deep Blue).
○ Limited Memory: Can use past data for decision-making (e.g., self-driving cars).
○ Theory of Mind AI: Can understand emotions and thoughts (under research).
The sample space for rolling two dice contains 6×6=366 \times 6 = 36 possible outcomes.
(a) Find P(X=2,Y=6)P(X=2, Y=6)
Poisson Distribution models the probability of a given number of events occurring in a fixed
interval of time or space, assuming the events occur independently and at a constant rate.
Formula:
Where:
● kk = number of occurrences
Applications of AI:
OR
The Central Limit Theorem (CLT) states that, regardless of the population distribution, the
distribution of the sample mean will approach a normal distribution as the sample size increases
(typically n>30n > 30).
If X1,X2,...,Xn are i.i.d. with mean μ and variance σ2, then Xˉ−μσ/n≈N(0,1) as n→∞.\text{If }
X_1, X_2, ..., X_n \text{ are i.i.d. with mean } \mu \text{ and variance } \sigma^2, \text{ then }
\frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \approx N(0,1) \text{ as } n \to \infty.
Q.2 (B) Define Exploratory Data Analysis and explain its importance in data
analysis.
Exploratory Data Analysis (EDA) is a statistical approach used to analyze datasets to
summarize key characteristics, identify patterns, and detect anomalies before applying machine
learning models.
Importance:
OR
Q.2 (B) Define the Range and explain its calculation methods.
Definition:
The Range is the difference between the maximum and minimum values in a dataset.
Formula:
Example Calculation:
For dataset {10, 22, 45, 68, 92},
Range=92−10=82\text{Range} = 92 - 10 = 82
Q.2 (C) Calculate the quartiles and find the interquartile range (IQR) for the
given dataset.
Dataset:
18, 34, 68, 22, 10, 92, 46, 52, 38, 29, 45, 37, 10, 30, 50, 70, 90
10, 10, 18, 22, 29, 30, 34, 37, 38, 45, 46, 50, 52, 68, 70, 90, 92
● Q1 (First Quartile, 25th Percentile): Median of the first half → Q1=29Q1 = 29
● Q2 (Median, 50th Percentile): Middle value → Q2=38Q2 = 38
● Q3 (Third Quartile, 75th Percentile): Median of the second half → Q3=68Q3 = 68
Q.2 (C) Find the mean, median, mode, and standard deviation of the given
weights.
Given Data:
x1=3.5x_1 = 3.5, x2=12.3x_2 = 12.3, x3=17.7x_3 = 17.7, x4=20.9x_4 = 20.9, x5=23.1x_5 = 23.1
1. Mean (Average):
Mean=∑xin=3.5+12.3+17.7+20.9+23.15\text{Mean} = \frac{\sum x_i}{n} = \frac{3.5 + 12.3 + 17.7
+ 20.9 + 23.1}{5} =77.55=15.5 kg= \frac{77.5}{5} = 15.5 \text{ kg}
Since we have 5 values (odd number), the median is the middle value:
Definition Assigns labels to data (e.g., cat vs. Predicts continuous values (e.g.,
dog) temperature)
Y=mX+cY = mX + c
Where:
● YY = dependent variable
● XX = independent variable
● mm = slope
● cc = intercept
Example:
Predicting house prices based on size (sq ft). If:
A Decision Tree is a tree-like model used for classification and regression. It splits data into
branches based on feature values.
Example:
For predicting whether a student will pass an exam:
Used when data follows a curved pattern rather than a straight line.
Q.3 (B) Discuss the concept of ensemble learning and how it is utilized in
random forests.
Random Forest is an ensemble of multiple Decision Trees. Each tree is trained on a random
subset of the data, and the final prediction is based on majority voting (classification) or
averaging (regression).
Advantages:
● Reduces overfitting
● Improves accuracy
Q.3 (C) Explain the concept of Support Vector Machine with an example.
A Support Vector Machine (SVM) is a supervised learning algorithm that finds the best
decision boundary (hyperplane) to classify data.
Example:
For classifying emails as spam or not spam, SVM finds the best boundary between the two
categories.
(1) Which Evaluation Metrics do we use for the Classification Problem? Explain any
three.
K-Means is a clustering algorithm that partitions data into K clusters based on feature similarity.
It minimizes the variance within each cluster.
Steps:
Advantages:
● No need for labeled data
Challenges:
● Needs fine-tuning
Methods:
3. Autoencoders
Clustering is an unsupervised learning technique that groups similar data points together.
Types of Clustering: