0% found this document useful (0 votes)
12 views33 pages

ML LAB Manual-1

The document outlines various statistical and machine learning techniques, including measures of central tendency and dispersion, data preprocessing methods, KNN classification and regression, decision tree algorithms for classification and regression, and random forest applications. It provides Python code examples for each technique, demonstrating their implementation using libraries like pandas, sklearn, and matplotlib. The outputs include accuracy scores, RMSE values, and visualizations of decision trees.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views33 pages

ML LAB Manual-1

The document outlines various statistical and machine learning techniques, including measures of central tendency and dispersion, data preprocessing methods, KNN classification and regression, decision tree algorithms for classification and regression, and random forest applications. It provides Python code examples for each technique, demonstrating their implementation using libraries like pandas, sklearn, and matplotlib. The outputs include accuracy scores, RMSE values, and visualizations of decision trees.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

PROGRAM:

#Compute Central Tendency Measures: Mean, Median, Mode Measure of Dispersion:


Variance, Standard Deviation.
# Python code to demonstrate the working of mean()
# importing statistics to handle statistical
# operations

import statistics
# initializing list
li = [1,2, 3, 3, 2, 2, 2,1]

# Median
median = statistics.median(data)
print("Median:", median)

# Mode mode =
statistics.mode(data)
print("Mode:", mode)

# Variance
variance = statistics.variance(data)
print("Variance:", variance)

# Standard Deviation
std_deviation = statistics.stdev(data)
print("Standard Deviation:", std_deviation)

OUTPUT:
Median: 2.0
Mode: 2
Variance: 0.5714285714285714
Standard Deviation: 0.7559289460184545
PROGRAM:
#Apply the following Pre-processing techniques for a given dataset.
#Attribute selection
#Handling Missing Values
#Discretization
#Elimination of Outliers
A. Attribute selection
METHOD1: Display Attributes or Features from Dataframe

import pandas as pd
# Creating the dataframe as shown above
df = pd.DataFrame({'Job Position': ['CEO', 'Senior Manager',
'Junior Manager', 'Employee',
'Assistant Staff'], 'Years of Experience':[5, 4, 3, , 1],
Salary':[100000,80000,90000,40000,
20000]})
print(df.columns)

METHOD2: Feature Selection from Dataset

# Load Iris dataset from Scikit-learn from sklearn.datasets import


load_iris

# Create input and output features feature_names =


load_iris().feature_names X_data = pd.DataFrame(load_iris().data,
columns=feature_names)

y_data = load_iris().target

# Show the first five rows of the dataset


X_data.head()

METHOD 3:
import pandas as pd
# Load the dataset (replace 'your_data.csv' with the actual path)
data = pd.read_csv('your_data.csv') # Display the feature names
print("Feature Names:")
print(data.columns)
OUTPUT:
Index(['Job Position', 'Years of Experience', 'Salary'], dtype='object')

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
B. Handling Missing Values
import pandas as pd

# Creating the DataFrame with some missing values


df = pd.DataFrame({
'Job Position': ['CEO', 'Senior Manager', 'Junior Manager', 'Employee',
'Assistant Staff'],
'Years of Experience': [5, 4, 3, None, 1],
'Salary': [100000, 80000, None, 40000, 20000]
})

# Viewing the original dataframe


print("Original DataFrame with Missing Values:")
print(df)

Method1: Data Removal


Remove the missing data rows (data points) from the dataset.
# Dropping the rows with missing values (rows 2 and 3)
dropped_df = df.drop([2, 3], axis=0)

# Viewing the cleaned dataframe


print("\nMethod 1 - After Dropping Rows with Missing Values:")
print(dropped_df)
Method 2: Fill missing values

# Filling missing values with the mean of their respective columns


df['Years of Experience'] = df['Years of Experience'].fillna(df['Years of
Experience'].mean())
df['Salary'] = df['Salary'].fillna(df['Salary'].mean())

# Viewing the updated dataframe


print("\nMethod 2 - After Filling Missing Values with Mean:")
print(df)

OUTPUT:

Original DataFrame with Missing Values:


Job Position Years of Experience Salary
0 CEO 5.0 100000.0
1 Senior Manager 4.0 80000.0
2 Junior Manager 3.0 NaN
3 Employee NaN 40000.0
4 Assistant Staff 1.0 20000.0

Method 1 - After Dropping Rows with Missing Values:


Job Position Years of Experience Salary
0 CEO 5.0 100000.0
1 Senior Manager 4.0 80000.0

4 Assistant Staff 1.0 20000.0


Method 2 - After Filling Missing Values with Mean:
Job Position Years of Experience Salary
0 CEO 5.00 100000.0
1 Senior Manager 4.00 80000.0
2 Junior Manager 3.00 60000.0
3 Employee 3.25 40000.0
4 Assistant Staff 1.00 20000.0
C. Discretization:
Program:
import pandas as pd

# Sample data
data = {'age': [25, 30, 28, 45, 50, 35, 22, 60]}
df = pd.DataFrame(data)

# Discretize 'age' into 3 categories using quantile-based binning


df['age_category'] = pd.qcut(df['age'], q=[0, 0.33, 0.66, 1],
labels=['Young', 'Middle Aged', 'Old'])

# Viewing the final DataFrame


print(df)

OUTPUT:
Age age_category
0 25 Young
1 30 Middle Aged
2 28 Young
3 45 Old
4 50 Old
5 35 Middle Aged
6 22 Young
7 60 Old
D. Elimination of Outliers

METHOD1: Boxplots

import matplotlib.pyplot as plt

# Sample data
sample = [15, 105, 18, 7, 13, 16, 11, 21, 5, 15, 10, 9]

# Creating the boxplot


plt.boxplot(sample, vert=False)

# Adding title and labels


plt.title("Detecting Outliers Using Boxplot")
plt.xlabel("Sample Values")

# Displaying the plot


plt.show()

METHOD2: Z-SCORE

import numpy as np

def detect_outliers_zscore(data):
outliers = [] # define inside the function to avoid accumulating from
previous calls
threshold = 3
mean = np.mean(data)
std = np.std(data)

print("Mean:", mean, "| Standard Deviation:", std)

for i in data:
z_score = (i - mean) / std
if np.abs(z_score) > threshold:
outliers.append(i)

return outliers

# Sample data (assuming you defined 'sample' before, like in the boxplot
example)
sample = [15, 105, 18, 7, 13, 16, 11, 21, 5, 15, 10, 9]

# Detect outliers using the Z-score method


sample_outliers = detect_outliers_zscore(sample)

print("Outliers from Z-score method:", sample_outliers)


METHOD3: IQR

import numpy as np

def detect_outliers_iqr(data):
outliers = [] # Declare inside the function to avoid reusing from previous calls
data = sorted(data)

# Calculate Q1 (25th percentile) and Q3 (75th percentile)


q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)

# Compute IQR
IQR = q3 - q1

# Define bounds
lower_bound = q1 - (1.5 * IQR)
upper_bound = q3 + (1.5 * IQR)

# Debug print (optional)


print(f"Q1: {q1}, Q3: {q3}, IQR: {IQR}")
print(f"Lower Bound: {lower_bound}, Upper Bound: {upper_bound}")

# Find outliers
for i in data:
if i < lower_bound or i > upper_bound:
outliers.append(i)
return outliers

# Sample data (same as before)


sample = [15, 105, 18, 7, 13, 16, 11, 21, 5, 15, 10, 9]

# Detect outliers using the IQR method


sample_outliers = detect_outliers_iqr(sample)

print("Outliers from IQR method:", sample_outliers)

OUTPUT:

Mean: 20.416666666666668
Standard Deviation: 25.882614284925356
Outliers from Z-score method: [105]
Q1: 9.75, Q3: 16.5, IQR: 6.75
Lower Bound: -0.375, Upper Bound: 26.625
Outliers from IQR met
PROGRAM:
#Apply KNN algorithm for classification and regression
from sklearn.neighbors import KNeighborsClassifier,
KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score, mean_squared_error
import matplotlib.pyplot as plt
import numpy as np

# Load the Iris dataset


iris = load_iris()

# Data for Classification


X_class = iris.data # All features
y_class = iris.target # Class labels

# Data for Regression (Petal Length -> Sepal Length)


X_reg = iris.data[:, 2].reshape(-1, 1) # Petal length
y_reg = iris.data[:, 0] # Sepal length

# Split data into training and testing sets


X_train_class, X_test_class, y_train_class, y_test_class =
train_test_split(X_class, y_class, test_size=0.2, random_state=42)
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg,
y_reg, test_size=0.2, random_state=42)
# Find optimal k for classification
k_range = range(1, 21)
accuracies = []
for k in k_range:
knn_clf = KNeighborsClassifier(n_neighbors=k)
knn_clf.fit(X_train_class, y_train_class)
y_pred_class = knn_clf.predict(X_test_class)
accuracies.append(accuracy_score(y_test_class, y_pred_class))

optimal_k_class = k_range[np.argmax(accuracies)]
print(f"Optimal k for classification: {optimal_k_class}")

# Find optimal k for regression


rmse_values = []
for k in k_range:
knn_reg = KNeighborsRegressor(n_neighbors=k)
knn_reg.fit(X_train_reg, y_train_reg)
y_pred_reg = knn_reg.predict(X_test_reg)
rmse_values.append(np.sqrt(mean_squared_error(y_test_reg,
y_pred_reg)))

optimal_k_reg = k_range[np.argmin(rmse_values)]
print(f"Optimal k for regression: {optimal_k_reg}")
# Plot the results
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(k_range, accuracies, marker='o')
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('Accuracy')
plt.title('KNN Classification Accuracy')

plt.subplot(1, 2, 2)
plt.plot(k_range, rmse_values, marker='o')
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('RMSE')
plt.title('KNN Regression RMSE')

plt.tight_layout()
plt.show()
OUTPUT:
PROGRAM:
#Demonstrate decision tree algorithm for a classification problem and perform parameter tuning
#for better results
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

# Load the Iris dataset


iris = load_iris()
X = iris.data
y = iris.target

# Split the data


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

# Create a Decision Tree Classifier


clf = DecisionTreeClassifier(random_state=42)

# Train the model


clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Evaluate initial accuracy


accuracy = accuracy_score(y_test, y_pred)
print("Initial Accuracy:", accuracy)

# Define parameter grid for tuning


param_grid = {
'criterion': ['gini', 'entropy'],
'max_depth': [None, 5, 10, 15],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}

# Perform Grid Search with Cross-Validation


grid_search = GridSearchCV(estimator=clf, param_grid=param_grid,
cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Print best parameters and score


print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

# Train with best parameters


best_clf = DecisionTreeClassifier(**grid_search.best_params_)
best_clf.fit(X_train, y_train)

# Evaluate with best parameters


best_y_pred = best_clf.predict(X_test)
best_accuracy = accuracy_score(y_test, best_y_pred)
print("Accuracy with Best Parameters:", best_accuracy)

# Visualize the decision tree


plt.figure(figsize=(15, 10))
plot_tree(best_clf, feature_names=iris.feature_names,
class_names=iris.target_names, filled=True)
plt.show()
OUTPUT:
Initial Accuracy: 1.0
Best Parameters: {'criterion': 'gini', 'max_depth': None,
'min_samples_leaf': 1, 'min_samples_split': 10}
Best Score: 0.9428571428571428
Accuracy with Best Parameters: 1.0
PROGRAM:
# Demonstrate decision tree algorithm for a regression problem

from sklearn.tree import DecisionTreeRegressor, plot_tree


from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import numpy as np

# Load the Iris dataset


iris = load_iris()

# Select features and target


X = iris.data[:, 2].reshape(-1, 1) # Petal length as the feature
y = iris.data[:, 0] # Sepal length as the target

# Split data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

# Create a Decision Tree Regressor


regressor = DecisionTreeRegressor(random_state=42)

# Train the model


regressor.fit(X_train, y_train)

# Make predictions
y_pred = regressor.predict(X_test)

# Evaluate the model


mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print("Root Mean Squared Error:", rmse)

# Visualize the decision tree


plt.figure(figsize=(10, 6))
plot_tree(regressor, feature_names=["Petal Length"], filled=True)
plt.show()
OUTPUT:
PROGRAM:
# Apply Random Forest algorithm for classification and regression
CLASSIFICATION:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset


data = load_iris()
X = data.data
y = data.target

# Split data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)

# Initialize and train Random Forest classifier


rf_classifier = RandomForestClassifier(n_estimators=100,
random_state=42)
rf_classifier.fit(X_train, y_train)

# Make predictions
y_pred = rf_classifier.predict(X_test)

# Evaluate model performance


accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

OUTPUT:

Accuracy: 100.00%
REGRESSION:
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Generate a random regression dataset


X, y = make_regression(n_samples=100, n_features=5, noise=0.1,
random_state=42)

# Split data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)

# Initialize and train Random Forest regressor


rf_regressor = RandomForestRegressor(n_estimators=100,
random_state=42)
rf_regressor.fit(X_train, y_train)

# Make predictions
y_pred = rf_regressor.predict(X_test)

# Evaluate model performance using Mean Squared Error


mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

OUTPUT:

Mean Squared Error: 5253.45


PROGRAM:
# Demonstrate Naïve Bayes Classification algorithm.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report
from sklearn.datasets import load_iris

# Load the Iris dataset


data = load_iris()
X = data.data # Features
y = data.target # Labels

# Convert to DataFrame for better visualization


df = pd.DataFrame(X, columns=data.feature_names)
df['Target'] = y
print(df.head())

# Split data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)

# Initialize and train the model


nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)

# Predict on test data


y_pred = nb_classifier.predict(X_test)

# Compute accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", conf_matrix)

# Classification Report
print("\nClassification Report:\n", classification_report(
y_test, y_pred, target_names=data.target_names
))

# Visualize the Confusion Matrix


plt.figure(figsize=(6, 4))
sns.heatmap(
conf_matrix,
annot=True,
cmap="Blues",
fmt="d",
xticklabels=data.target_names,
yticklabels=data.target_names
)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix - Naïve Bayes")
plt.tight_layout()
plt.show()
OUTPUT:
PROGRAM:
# Apply Support Vector algorithm for classification
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Iris dataset (a classic example for classification)


iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)

# Feature Scaling (important for SVMs)


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create an SVM classifier (using RBF kernel)


svm_classifier = SVC(kernel='rbf', C=1, gamma='scale',
random_state=42)
svm_classifier.fit(X_train_scaled, y_train)

# Make predictions on the test set


y_pred = svm_classifier.predict(X_test_scaled)

# Evaluate the accuracy of the classifier


accuracy = accuracy_score(y_test, y_pred)
print(f"RBF Kernel Accuracy: {accuracy:.2f}")

# Create and evaluate a linear kernel SVM


linear_svm = SVC(kernel='linear', C=1, random_state=42)
linear_svm.fit(X_train_scaled, y_train)
linear_pred = linear_svm.predict(X_test_scaled)
linear_accuracy = accuracy_score(y_test, linear_pred)
print(f"Linear Kernel Accuracy: {linear_accuracy:.2f}")

OUTPUT:

RBF Kernel Accuracy: 1.00


Linear Kernel Accuracy: 0.98

You might also like