ML W8 Merged
ML W8 Merged
Date:
Aim:
Develop a program for Bias, Variance, Remove duplicates , Cross Validation.
Description:
Bias:
• Bias refers to errors introduced by approximating a real-world problem, which may be
too complex, with a simplified model. High bias leads to underfitting, where the
model is too simple to capture the underlying patterns in the data.
2. Variance:
• Variance is the error caused by the model’s sensitivity to small fluctuations in the
training data. High variance results in overfitting, where the model captures noise
instead of the underlying pattern, making it perform poorly on new, unseen data.
3. Remove Duplicates:
• Removing duplicates involves identifying and eliminating duplicate records from the
dataset. This helps in reducing redundancy, improving the quality of data, and
ensuring that the model is not biased by repeated data points.
4. Cross-Validation:
• Cross-validation is a technique to evaluate the performance of a model by dividing the
dataset into multiple subsets, training the model on some subsets, and testing it on the
remaining ones. It helps in assessing how well the model generalizes to unseen data,
reducing the risk of overfitting.
Program:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score,precision_score, recall_score
import matplotlib.pyplot as plt
import seaborn as sns
df.info()
# Remove duplicates
df.drop_duplicates(inplace=True)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df[iris['feature_names']],df ['target'],
test_size= 0.4, random_state=0)
# Fit a Random Forest Classifier model
model = RandomForestClassifier()
model.fit(X_train, y_train)
print("Bias: {:.2f}".format(bias))
print("Variance: {:.2f}".format(variance))
# Cross-validation
cv_scores = cross_val_score(model, df[iris['feature_names']], df['target'], cv=5)
print("\nCross-validation scores: ", cv_scores)
print("Mean CV Accuracy: {:.2f}".format(np.mean(cv_scores)))
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal length (cm) 150 non-null float64
1 sepal width (cm) 150 non-null float64
2 petal length (cm) 150 non-null float64
3 petal width (cm) 150 non-null float64
4 target 150 non-null int64
dtypes: float64(4), int64(1)
memory usage: 6.0 KB
Bias: 0.02
Variance: 0.97
Accuracy: 0.97
Precision: 0.97
Recall: 0.97
Viva Questions:
1.What is bias in machine learning?
Ans : Bias is the error introduced by approximating a real-world problem with a simplified
model.
2.What happens when a model has high bias?
Ans : It underfits the data, leading to poor performance on both training and test sets.
3.What is variance in machine learning?
Ans : Variance is the model’s sensitivity to small fluctuations in the training data.
4.How does variance affect model generalization?
Ans : High variance leads to poor generalization to unseen data.
5.How can you remove duplicates in a dataset?
Ans : By using functions like drop_duplicates() in pandas for Python.
6.What is k-fold cross-validation?
Ans : A method where data is split into k subsets, and the model is trained k times, each time
using a different subset as validation.
Aim:
Consider Patient Dataset. Apply linear classification technique(SVM) to identify the rate of
social networks ads.
Description:
A Support Vector Machine (SVM) is a powerful machine learning algorithm widely used for
both linear and nonlinear classification, as well as regression and outlier detection tasks.
SVMs are highly adaptable, making them suitable for various applications such as text
classification, image classification, spam detection, handwriting identification, gene
expression analysis, face detection, and anomaly detection.
The dimension of the hyperplane depends on the number of features. For instance, if there are
two input features, the hyperplane is simply a line, and if there are three input features, the
hyperplane becomes a 2-D plane. As the number of features increases beyond three, the
complexity of visualizing the hyperplane also increases.
Dataset:
Program:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("Social_Network_Ads.csv")
df.head()
df.shape
df.info()
X = df.iloc[:,[2,3]]
X
Y = df.iloc[:,4]
Y
from sklearn.model_selection import train_test_split
X_Train,X_Test,Y_Train,Y_Test = train_test_split(X,Y,test_size = 0.25,random_state =0)
X_Train.shape
X_Test.shape
X_Train.describe()
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform (X_Train)
p1.axis('tight')
p1.show()
Output:
Viva Questions:
1.What is the main objective of SVM?
Ans : To find the optimal hyperplane that maximizes the margin between different classes.
2. What is a hyperplane in SVM?
Ans : A decision boundary that separates classes in the feature space.
3.What is the margin in SVM?
Ans : The distance between the closest data points of each class and the hyperplane.
4.What is a support vector?
Ans : Data points closest to the hyperplane, which influence its position and orientation.
5. What is a kernel in SVM?
Ans : A function that transforms data into a higher-dimensional space to make it linearly
separable.
Aim: Write a program to implement k-Nearest Neighbor algorithm to classify the iris data
set. Print both correct and wrong predictions
Description:
The K-nearest neighbors (KNN) algorithm is a supervised machine learning algorithm that
classifies data points based on the class of their closest neighbors:
How it works:\
The KNN algorithm classifies new data points by looking at the labels of the closest
neighbors in the training dataset's feature space. The algorithm is based on the principle of
"information gain," which means it finds the most suitable way to predict an unknown value.
KNN is useful when labeled data is expensive or hard to get, and it can be used for a wide
variety of prediction problems. It's also used in many areas, including image recognition,
handwriting detection, and video recognition.
Dataset:
{'data': array([[5.1, 3.5, 1.4, 0.2],
………………………………………………..
[6.5, 3. , 5.2, 2. ],
'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
…………………………………………………….,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
Program:
from sklearn.datasets import load_iris
= load_iris() iris
X =
iris.data y
iris.target
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size =
accuracy_score(y_test,y_pred) print("Accuracy",accuracy)
Output:
KNeighborsClassifier
KNeighborsClassifier(n_neighbors=3)
Accuracy 0.9666666666666667
Viva Questions:
1. What is K-Nearest Neighbors (KNN)?
Ans: KNN is a supervised learning algorithm used for both classification and regression. It
works by finding the 'k' closest data points (neighbors) to a query point and predicting the
class or value based on the majority label or average of these neighbors.
EXPERIMENT-5
Aim: To solve the real-world problems using Logistic Regression
Description:
Logistic regression is a machine learning algorithm that uses a statistical method to
predict the probability of a binary outcome based on independent variables:
Purpose
Logistic regression is used to classify data into categories and understand the
relationship between variables.
How it works
Logistic regression uses a sigmoid function to model the relationship between
variables and output a probability between 0 and 1.
When to use it
Logistic regression is used when the outcome variable is binary or categorical, such as
yes or no.
Applications
Logistic regression is used in many fields, including medical research and insurance.
For example, researchers can use logistic regression to calculate the risk of cancer by
considering patient habits and genetic predispositions.
Dataset:
Unnamed: 0 Age Sex ChestPain RestBP Chol Fbs RestECG
MaxHR ExAng Oldpeak Slope Ca Thal AHD
0 1 63 1 typical 145 233 1 2 150 0 2.3 3
0.0 fixed No
1 2 67 1 asymptomatic 160 286 0 2 108 1
1.5 2 3.0 normal Yes
2 3 67 1 asymptomatic 120 229 0 2 129 1
2.6 2 2.0 reversable Yes
3 4 37 1 nonanginal 130 250 0 0 187 0 3.5
3 0.0 normal No
4 5 41 0 nontypical 130 204 0 2 172 0 1.4
1 0.0 normal No
... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ...
298 299 45 1 typical 110 264 0 0 132 0 1.2 2
0.0 reversable Yes
299 300 68 1 asymptomatic 144 193 1 0 141 0
3.4 2 2.0 reversable Yes
300 301 57 1 asymptomatic 130 131 0 0 115 1
1.2 2 1.0 reversable Yes
301 302 57 0 nontypical 130 236 0 2 174 0 0.0
2 1.0 normal Yes
302 303 38 1 nonanginal 138 175 0 0 173 0 0.0
1 NaN normal No
303 rows × 15 columns
Program:
import pandas as pd
df = pd.read_csv("Heart.csv")
df.info()
df = df.drop(columns = "Unnamed: 0")
df
df['ChestPain'] = df['ChestPain'].astype('category')
df['ChestPain'] = df['ChestPain'].cat.codes
df
df['Thal'] = df['Thal'].astype('category')
df['Thal'] = df['Thal'].cat.codes
df['AHD'] = df['AHD'].astype('category')
df['AHD'] = df['AHD'].cat.codes
df
df.isnull().sum()
df = df.dropna()
df
X = df.drop(columns = 'AHD')
X
y = df['AHD']
y
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size =0.3, random_state =
23)
X_train
X_test
X_train_scaled,y_train
Output:
array([1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0,
…………………………………………………………
0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0,
0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1], dtype=int8)
0.8755980861244019
0.8111111111111111
Lasso
Lasso(alpha=50, max_iter=100, tol=0.1)
-0.0002953787464659019
-0.0002953787464659019
Viva Questions:
1. What is Logistic Regression?
Ans: Logistic regression is a machine learning algorithm that uses a statistical
method to predict the probability of a binary outcome based on independent variables.
2. How does Logistic Regression work?
Ans: Logistic regression uses a sigmoid function to model the relationship between
variables and output a probability between 0 and 1.
3. Why can't we use Linear Regression for classification problems?
Ans: Because its predictions are not restricted to the range [0, 1] and can produce
values that are greater than 1 or less than 0, which don't make sense for probabilities.
4. What is the cost function used in Logistic Regression?
Ans:In Logistic Regression, we use the Log Loss (or Binary Cross-Entropy) as the
cost function, which is defined as:
𝑚
J(θ) = 1/m ∑𝑖=0 [yilog(hθ(xi)) + (1 − yi)log(1 − hθ(xi))]
where yiy_iyi is the actual label, hθ(xi)h_\theta(x_i)hθ(xi) is the predicted probability,
and mmm is the number of training.
5. What is Multinomial Logistic Regression?
Ans: Multinomial Logistic Regression is an extension of binary Logistic Regression
that is used for multi-class classification problems (where the target variable has more
than two categories). It models the probability of each class as a function of the input
variables using a softmax function instead of the sigmoid function.
DECISION TREE
Aim: Write a program to demonstrate the working of the decision tree based ID3 algorithm.
Use an appropriate data set for building the decision tree and apply this knowledge to classify a
new sample.
Description: The ID3 (Iterative Dichotomiser 3) algorithm is a supervised learning
algorithm used to create decision trees for classification tasks. It works by recursively
partitioning the dataset based on the feature that provides the highest information gain, which is
a measure of how well a feature separates the data into distinct classes.
• Entropy is a measure of disorder or uncertainty in a dataset. It helps determine the
impurity in a set of examples.
• Information Gain (IG) is a measure of the effectiveness of a feature in classifying the
training data. It quantifies the reduction in entropy after the dataset is split based on a
specific feature.
Dataset:
Advantages of ID3:
• Simple and easy to understand.
• Requires little training data.
• Can work well with data with discrete and continuous attributes.
Disadvantages of ID3:
• Can lead to overfitting.
• May not be effective with data with many attributes.
Applications of ID3:
1.Fraud detection
2.Medical diagnosis
3.Customer segmentation
4.Risk assessment
5.Recommendation systems
Formulas:-
1.Entropy:
A measure of disorder or uncertainty in a set of data is called Entropy.
Information Gain:
A measure of how well a certain quality reduces uncertainty is called Information Gain. ID3
splits the data at each stage, choosing the property that maximizes Information Gain.