ML Lab Exam Document
ML Lab Exam Document
data_array = np.array(data)
# MEAN
mean = np.mean(data_array)
# MEDIAN
median = np.median(data_array)
# MODE
mode_result = stats.mode(data_array)
# RANGE
range_value = np.ptp(data_array)
# VARIANCE
variance = np.var(data_array, ddof = 1)
# STANDARD DEVIATION
std = np.std(data_array, ddof = 1)
# Print results
print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)
print("Range:", range_value)
print("Variance:", variance)
print("Standard Deviation:", std)
a) MIN-MAX Normalization:
Min-Max Scaling (or Min-Max Normalization) is one of the simplest methods where
the values are scaled to a fixed range — usually 0 to 1. The formula for calculating
the Min-Max Scaling is:
𝑋ₙₒᵣₘ = (𝑋 − 𝑋ₘᵢₙ) / (𝑋ₘₐₓ − 𝑋ₘᵢₙ) are the minimum and maximum values in the feature,
respectively. This method is best used when the distribution is not Gaussian or when
the standard deviation is very small. However, it’s sensitive to outliers.
CODE:
(Using Data Given in class)
import pandas as pd
def min_max_normalization(df):
return (df - df.min()) / (df.max() - df.min())
b) Mean Normalization:
CODE:
(Using Data Given in class)
import pandas as pd
def mean_normalization(df):
return (df - df.mean()) / (df.max() - df.min())
normalized_data = mean_normalization(data)
print(normalized_data)
c) Z-Score Normalization:
Standardization (or Z-Score Normalization) transforms the features so they have the
properties of a standard normal distribution with a mean of 0 and a standard deviation
of 1:
𝒁 = (𝑋 − μ) / σ
where μ is the mean of the feature and σ is the standard deviation. This method is less
affected by outliers and is suitable for algorithms that assume the input data is
normally distributed.
CODE:
(Using Data Given in class)
def z_score_normalization(df):
return (df - df.mean()) / df.std()
data = pd.DataFrame({'A':
[3,5,5,8,9,12,12,13,15,16,17,19,22,24,25,134]})
standardized_data = z_score_normalization(data)
print(standardized_data)
d) Robust Scaling:
Robust Scaling uses the median and the interquartile range (instead of mean and
standard deviation in Z-score normalization). The IQR is the difference between the
75th percentile (Q3) and the 25th percentile (Q1) of the data. The 25th percentile is
the value below which 25% of the data falls, and the 75th percentile is the value
below which 75% of the data falls. The IQR thus represents the middle 50% of the
data and is used because it measures the variability in the data while ignoring the
influence of extreme outliers. It subtracts the median from the data points and divides
by the IQR:
This method is robust to outliers and is preferred if the data contains many outliers or
is skewed.
CODE:
(Using Data Given in class)
import pandas as pd
def robust_scaling(df):
median = df.median()
iqr = df.quantile(0.75) - df.quantile(0.25)
return (df - median) / iqr
Kmeans clustering
a) K-Mean Clustering using the data given in class and the iris-flower dataset.
K-Means Clustering is an Unsupervised Machine Learning algorithm, which groups
the unlabelled dataset into different clusters.
K means clustering, assigns data points to one of the K clusters depending on their
distance from the center of the clusters. It starts by randomly assigning the clusters
centroid in the space.
Advantages of K-means Clustering
1. Simple and Easy to Implement
2. Computationally Efficient
3. Fast Convergence
4. Versatile Applications
5. Produces Tight, Well-Separated Clusters
Disadvantages of K-means Clustering
1. Requires Prior Knowledge of k
2. Sensitive to Initial Centroids
3. Assumes Spherical Clusters of Similar Size
4. Sensitive to Outliers
5. Not Deterministic
6. Limited to Euclidean Distance
Applications of K-means
• Customer Segmentation: Group customers based on purchasing behavior.
• Image Compression: Reduce the number of colors in an image by clustering
pixels.
• Anomaly Detection: Identify outliers by clustering normal data points.
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.filterwarnings("ignore")
from sklearn.cluster import KMeans
df = pd.read_csv(" ")
df.head()
df.shape
labels = kmeans.labels_
df['cluster'] = labels
df.head()
centroids = kmeans.cluster_centers_
centroids
KNN Classifier
K-Nearest Neighbors (KNN) is a simple, non-parametric, and instance-based algorithm
used for classification (and regression). The KNN algorithm classifies a new data point
based on the majority label of the kkk nearest data points in the feature space. It relies
on a distance metric (commonly Euclidean) to identify the closest neighbors.
Applications of KNN Classifier
1. Recommendation Systems: KNN is used for collaborative filtering in
recommendation engines by finding users with similar preferences.
2. Image and Video Recognition: Used in pattern recognition tasks such as image
classification, object detection, and video recognition.
3. Medical Diagnosis: KNN is applied in diagnosing diseases by comparing patient
symptoms with historical cases.
4. Finance and Banking: Used in credit rating, loan approval, and fraud detection
based on similarities in financial behavior
Advantages of KNN Classifier
1. Simple to Understand and Implement
2. No Training Phase
3. Adaptability to New Data
4. Versatile in Distance Metrics
5. Effective with Small Datasets
Disadvantages of KNN Classifier
1. Computationally Expensive at Prediction
2. Storage-Intensive
3. Sensitive to Irrelevant Features
4. Sensitive to Feature Scaling
5. Poor Performance with Imbalanced Data
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.filterwarnings("ignore")
df = pd.read_csv(r"/")
df.head()
df.shape
test_df = df.tail(1)
test_df = test_df.drop('Species', axis=1)
test_df.head()
df = df.drop(df.index[-1])
df.shape
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
X.shape
y.shape
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X, y)
knn_pred = knn.predict(X)
accuracy_score(y, knn_pred)
test_pred = knn.predict(test_df)
print(test_pred)
Continous Matrix
A continuous matrix is a matrix that contains continuous numerical values (as opposed
to discrete or categorical values). This type of matrix is common in numerical data
representation, where the entries represent measurements, intensities, probabilities, or
other continuous data values. Continuous matrices are widely used in linear algebra,
machine learning, and data science.
Applications of Continuous Matrices
1. Image and Signal Processing: Represent pixel intensities or signal amplitudes in
a structured format.
2. Machine Learning and Data Science: Used as inputs for models, especially in
regression, clustering, and neural networks.
3. Physics and Engineering: Model physical systems and solve differential
equations in areas like mechanics and electromagnetism.
4. Economics and Finance: Analyze time series, forecast trends, and model market
behaviors.
Advantages of Continuous Matrices
1. Supports Complex Calculations
2. High Precision in Representing Data
3. Facilitates Machine Learning Models
4. Smooth Interpolations
5. Enables Statistical Analysis
Disadvantages of Continuous Matrices
1. Sensitive to Noise
2. Memory Intensive
3. Requires Preprocessing
4. Complexity in Interpretation
5. Computationally Intensive
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
import warnings
warnings.filterwarnings('ignore')
cm = pd.read_csv
cm
for i in range(num_classes):
TP = confusion_matrix[i, i]
FN = confusion_matrix[i, :].sum() - TP
FP = confusion_matrix[:, i].sum() - TP
TN = total - (TP + FP + FN)
Decision Tree
A decision tree is a supervised machine learning algorithm used for classification and
regression tasks. It works by recursively splitting the data into subsets based on feature
values, creating a tree-like model of decisions. Each internal node represents a decision
on a feature, each branch represents an outcome of the decision, and each leaf node
represents a final prediction or outcome.
Applications of Decision Trees
1. Medical Diagnosis: Used to assist in medical decision-making by classifying
patients based on symptoms and test results.
2. Customer Segmentation: Helps in dividing customers into distinct groups for
targeted marketing.
3. Credit Scoring: Evaluates loan applicants' risk based on financial behavior and
credit history.
4. Fraud Detection: Identifies suspicious patterns in transactions that may indicate
fraud
Advantages of Decision Trees
1. Easy to Interpret and Visualize
2. Requires Minimal Data Preparation
3. Handles Both Numerical and Categorical Data
4. Non-Parametric Model
5. Fast and Efficient for Small to Medium Datasets
Disadvantages of Decision Trees
1. Prone to Overfitting
2. Unstable with Small Variations in Data
3. Biased Towards Features with More Levels
4. Less Accurate Than Ensemble Methods
5. Limited to Axis-Aligned Splits
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt
df = pd.read_csv
print("Dataset Head:")
print(df.head())
X = pd.get_dummies(X, drop_first=True)
model = DecisionTreeClassifier(criterion='entropy')
model.fit(X_train, y_train)
SVM Linear
A Linear Support Vector Machine (SVM) is a supervised machine learning algorithm
used for classification and regression. It finds the best hyperplane that separates data
points of different classes with the maximum margin. This hyperplane is chosen to
maximize the distance (or margin) between the closest data points from each class,
called support vectors. Linear SVM is most effective when data is linearly separable or
nearly so.
Applications of Linear SVM
1. Text Classification: Widely used for spam detection, sentiment analysis, and
categorizing documents, where data is often linearly separable.
2. Face Detection: Used to separate faces from non-face images, especially in low-
dimensional applications.
3. Image Classification: Helps classify images into different categories based on
visual features, effective in high-dimensional spaces.
4. Gene Classification: Applied in bioinformatics for classifying genes or proteins,
which often involves high-dimensional data.
Advantages of Linear SVM
1. Effective for Linearly Separable Data
2. High Accuracy and Robustness
3. Works Well with High-Dimensional Data
4. Memory Efficient
5. Clear Geometric Interpretation
Disadvantages of Linear SVM
1. Limited to Linearly Separable Data
2. Sensitive to Noise and Outliers
3. Requires Careful Parameter Tuning
4. No Probabilistic Interpretation
5. Computationally Intensive with Large Datasets
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
data = pd.read_csv
x = data.drop('Class',axis=1)
y = data['Class']
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,random_state=42)
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
y_pred = linear_svm.predict(x_test)
y_pred
accuracy = accuracy_score(y_test,y_pred)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
SVM NON LINEAR
A Non-Linear Support Vector Machine (SVM) is an extension of the linear SVM used
for classification and regression tasks where the data is not linearly separable. Instead
of a straight hyperplane, a non-linear SVM uses kernel functions (such as the radial
basis function (RBF), polynomial, or sigmoid kernels) to map the data into a higher-
dimensional space where it can be linearly separated. This transformation allows SVM
to handle complex and non-linearly separable data.
Applications of Non-Linear SVM
1. Image Classification: Used for classifying complex images, where the
relationship between pixel values is non-linear.
2. Text Classification: Effective for classifying text data where relationships
between words or features are non-linear.
3. Bioinformatics: Applied to gene expression classification or protein structure
prediction where data can have complex, non-linear relationships.
4. Speech Recognition: Used for classifying speech patterns, where the data is often
non-linearly separable.
Advantages of Non-Linear SVM
1. Effective for Complex Data
2. High Classification Accuracy
3. Flexible with Different Kernels
4. Handles High-Dimensional Data Well
5. Robust to Overfitting (with Proper Tuning)
Disadvantages of Non-Linear SVM
1. Computationally Intensive
2. Memory Intensive
3. Sensitive to Parameter Selection
4. Long Training Time
5. Difficult to Interpret
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import *
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
df = pd.read_csv()
df
x = df.drop('Class',axis=1)
y = df['Class']
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
NAVIE BAYES
Naive Bayes is a probabilistic classifier based on Bayes' Theorem, assuming
independence between features. It calculates the probability of each class given the
feature values and assigns the class with the highest probability. Despite its simplicity, it
performs well in many real-world tasks, particularly when the independence
assumption approximately holds.
Applications of Naive Bayes
1. Text Classification: Widely used for spam filtering, sentiment analysis, and
document categorization.
2. Medical Diagnosis: Applied to classify diseases based on patient symptoms and
medical records.
3. Recommendation Systems: Used for recommending products or content by
classifying user preferences.
4. Email Filtering: Commonly used for filtering spam emails based on various
features.
Advantages of Naive Bayes
1. Simple and Fast
2. Works Well with Categorical Data
3. Handles Missing Data
4. Good Performance with High-Dimensional Data
5. Works Well with Imbalanced Data
Disadvantages of Naive Bayes
1. Assumption of Feature Independence
2. Poor Performance with Correlated Features
3. Limited Expressiveness
4. Requires Sufficient Data for Accurate Estimates
5. Not Suitable for Continuous Features without Discretization
4o mini
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings("ignore")
df = pd.read_csv()
df.head()
df.columns
df.tail()
df.shape
test_df = df.tail(4)
test_df = test_df.drop('Classification', axis=1)
test_df.head()
df = df.drop(df.index[-4:])
df.shape
le = LabelEncoder()
for i in df.columns:
if df[i].dtype == 'object':
df[i] = le.fit_transform(df[i])
for i in test_df.columns:
if test_df[i].dtype == 'object':
test_df[i] = le.fit_transform(test_df[i])
df.head()
test_df.head()
X = df.drop('Classification', axis=1)
y = df['Classification']
ngb = GaussianNB()
ngb.fit(X, y)
y_pred = ngb.predict(test_df)
y_pred
HIREACHY
In hierarchical clustering, the linkage method determines how the distance between
clusters is calculated. There are several types of linkage methods used to group similar
data points into clusters:
1. Single Linkage: This method calculates the distance between two clusters based
on the shortest distance between any pair of points, one from each cluster. It is
also known as nearest point linkage.
2. Complete Linkage: This method calculates the distance between two clusters
based on the longest distance between any pair of points, one from each cluster.
It is also known as furthest point linkage.
3. Average Linkage: This method calculates the distance between two clusters by
averaging the distances between all possible pairs of points, one from each
cluster. It is also known as UPGMA (Unweighted Pair Group Method with
Arithmetic Mean).
Single Linkage
Advantages:
1. Handles Non-Spherical Clusters
2. Flexible
3. Simple to Compute
Disadvantages:
1. Sensitive to Noise and Outliers
2. Can Lead to "Chaining"
Complete Linkage
Advantages:
1. Compact Clusters
2. Less Sensitive to Outliers
3. Good for Well-Separated Data
Disadvantages:
1. Computationally Intensive
2. Not Effective for Non-Spherical Data
Average Linkage
Advantages:
1. Balanced Approach
2. Works Well with Elliptical Clusters
3. Less Sensitive to Outliers
Disadvantages:
1. Computational Complexity
2. Less Effective for Long Chains
import numpy as np
import pandas as pd
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import adjusted_rand_score
from sklearn.metrics import silhouette_score, adjusted_rand_score
from scipy.cluster.hierarchy import fcluster
df = pd.read_csv
df.head()
plt.figure(figsize=(10, 7))
dendrogram(mergings_single, labels=df.index, leaf_rotation=90)
plt.title("Dendrogram (Single Linkage)")
plt.xlabel("Samples")
plt.ylabel("Distance")
plt.show()