Simple linear regression
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
def generate_dataset(n_samples=100):
np.random.seed(42)
X = 2 * np.random.rand(n_samples, 1)
y = 3 * X + 4 + np.random.randn(n_samples, 1)
class SimpleLinearRegression:
def __init__(self):
self.slope = None
self.intercept = None
def fit(self, X, y):
n = len(X)
X_mean = np.mean(X)
y_mean = np.mean(y)
numerator = np.sum((X - X_mean) * (y - y_mean))
denominator = np.sum((X - X_mean) ** 2)
self.slope = numerator / denominator
self.intercept = y_mean - self.slope * X_mean
def predict(self, X):
return self.slope * X + self.intercept
if __name__ == "__main__":
X, y = generate_dataset()
dataset = pd.DataFrame({
"X": X.flatten(),
"y": y.flatten()
})
print("Dataset:")
print(dataset)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = SimpleLinearRegression()
model.fit(X_train.flatten(), y_train.flatten())
y_pred = model.predict(X_test.flatten())
mse = mean_squared_error(y_test, y_pred)
print(f"Model Coefficients: Slope = {model.slope:.2f}, Intercept = {model.intercept:.2f}")
print(f"Mean Squared Error on Test Set: {mse:.2f}")
plt.scatter(X, y, color="blue", label="Actual Data")
plt.plot(X, model.predict(X.flatten()), color="red", label="Regression Line")
plt.xlabel('X')
plt.ylabel('y')
plt.title('Simple Linear Regression')
plt.legend()
plt.show()
Explanation:
Step-by-step breakdown:
• Step 1: Importing Libraries
o numpy: Used for generating synthetic data and performing numerical
operations.
o matplotlib.pyplot: Used for visualizing the data and the regression line.
o LinearRegression: This is the linear regression model from scikit-learn that will
be used to fit the data.
o train_test_split: Splits the dataset into training and testing sets.
o mean_squared_error: Used to evaluate the performance of the model by
computing the mean squared error.
• Step 2: Generating Synthetic Data
o We generate synthetic data using the equation y=3x+4y = 3x + 4y=3x+4 with
some added Gaussian noise. This helps simulate real-world data where the
relationship between variables is linear but with some randomness.
o X contains the feature values (input), and y contains the target values (output).
• Step 3: Splitting Data
o train_test_split() divides the data into training and testing sets. 80% of the data is
used for training, and 20% is used for testing.
• Step 4: Initializing the Model
o We create an instance of the LinearRegression class to initialize the linear
regression model.
• Step 5: Training the Model
o linear_reg.fit(X_train, y_train) fits the model to the training data, learning the
coefficients (slope and intercept) that best fit the linear relationship between X
and y.
• Step 6: Making Predictions
o y_pred = linear_reg.predict(X_test) predicts the target values (y_pred) for the test
data (X_test).
• Step 7: Evaluating the Model
o Mean Squared Error (MSE) is used to measure how well the model fits the data.
A lower MSE indicates a better fit.
o R-squared measures the proportion of the variance in the target variable that is
predictable from the features. A value closer to 1 indicates a good fit.
• Step 8: Visualizing the Results
o We use matplotlib to visualize the test data points (X_test vs. y_test) and the
regression line that represents the model's predictions.
How Linear Regression Works:
Linear regression attempts to model the relationship between a dependent variable y and an
independent variable X by fitting a straight line to the data. The relationship is described by the
equation:
y=β0+β1⋅Xy = \beta_0 + \beta_1 \cdot Xy=β0+β1⋅X
Where:
• y is the target variable (output),
• X is the input feature (independent variable),
• β0\beta_0β0 is the intercept (where the line crosses the y-axis),
• β1\beta_1β1 is the slope of the line.
The goal of the algorithm is to find the values of β0\beta_0β0 and β1\beta_1β1 that minimize the
difference between the predicted values ypredy_{pred}ypred and the actual values yyy (using a
loss function like Mean Squared Error).
KNN program
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
class KNNClassifier:
def __init__(self, k=3):
self.k = k self.X_train = None
self.y_train = None
def fit(self, X, y):
self.X_train = X
self.y_train = y
def predict(self, X):
predictions = []
for x in X:
distances = np.linalg.norm(self.X_train - x, axis=1)
nearest_indices = distances.argsort()[:self.k]
nearest_labels = self.y_train[nearest_indices]
prediction = np.bincount(nearest_labels).argmax()
predictions.append(prediction)
return np.array(predictions)
iris = load_iris()
iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']]
columns=iris['feature_names'] + ['target'])
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
knn = KNNClassifier(k=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
print("\nPredictions:")
for i, (true_label, pred_label) in enumerate(zip(y_test, y_pred)):
status = "Correct" if true_label == pred_label else "Incorrect"
print(f"Test Sample {i + 1}: True Label = {true_label}, Predicted = {pred_label}, {status}")
Explanation:
• Step 1: Importing Libraries
o numpy: Used for handling arrays and matrix operations.
o train_test_split: This function splits the dataset into training and testing subsets.
o KNeighborsClassifier: This is the KNN classifier from scikit-learn that will be
used for training and prediction.
o load_iris: A function to load the Iris dataset, which is a classic dataset used for
classification tasks.
o accuracy_score: This function calculates the accuracy of predictions by
comparing the predicted labels to the true labels.
• Step 2: Loading the Dataset
o We use load_iris() to load the Iris dataset, which is a simple classification
dataset where the goal is to predict the type of iris flower (Setosa, Versicolour, or
Virginica) based on four features (sepal length, sepal width, petal length, petal
width).
• Step 3: Splitting the Data
o train_test_split() splits the data into a training set and a test set (80% for training
and 20% for testing in this case). It shuffles the data and ensures that we
evaluate the model on unseen data.
• Step 4: Creating the KNN Classifier
o We create an instance of KNeighborsClassifier, setting the number of neighbors
n_neighbors=3. This means that the class of a new data point will be predicted
based on the majority class among the 3 nearest neighbors.
• Step 5: Training the Model
o knn.fit(X_train, y_train) trains the model using the training dataset (X_train as
input features and y_train as target labels).
• Step 6: Making Predictions
o knn.predict(X_test) makes predictions on the test data based on the trained
model. The X_test is the feature set for which we want to predict the class labels.
• Step 7: Evaluating the Model
o accuracy_score(y_test, y_pred) compares the predicted labels (y_pred) with the
true labels (y_test) and calculates the accuracy.
How KNN Works:
KNN is a simple yet powerful classification algorithm:
• For each test data point:
1. It calculates the distance (usually Euclidean distance) from that point to every
other point in the training set.
2. Then, it selects the k nearest points.
3. The majority class among the k nearest neighbors is taken as the prediction for
the test data point.
K-means program
import numpy as np
import matplotlib.pyplot as plt
def initialize_centroids(X, k):
return X[np.random.choice(X.shape[0], k, replace=False)]
def compute_distance(a, b):
return np.sqrt(np.sum((a - b) ** 2))
def assign_clusters(X, centroids):
clusters = []
for point in X:
distances = [compute_distance(point, centroid) for centroid in centroids]
cluster = np.argmin(distances) # Find the index of the nearest centroid
clusters.append(cluster)
return np.array(clusters)
def update_centroids(X, clusters, k):
new_centroids = np.zeros((k, X.shape[1]))
for i in range(k):
new_centroids[i] = np.mean(X[clusters == i], axis=0)
return new_centroids
def k_means(X, k, max_iters=100, tolerance=1e-4):
centroids = initialize_centroids(X, k)
for i in range(max_iters):
clusters = assign_clusters(X, centroids)
new_centroids = update_centroids(X, clusters, k)
if np.all(np.abs(new_centroids - centroids) < tolerance):
print(f"Converged at iteration {i}")
break
centroids = new_centroids
return centroids, clusters
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
k=4
centroids, clusters = k_means(X, k)
plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis')
plt.scatter(centroids[:, 0], centroids[:, 1], s=300, c='red', marker='X')
plt.title("K-Means Clustering (from scratch)")
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
Explanation:
Step-by-step breakdown:
• Step 1: Importing Libraries
o numpy: For handling numerical data and matrix operations.
o matplotlib.pyplot: For visualizing the data points and clusters.
o make_blobs: A function to generate synthetic data with a specified number of
clusters.
o KMeans: The KMeans algorithm from scikit-learn for clustering the data.
• Step 2: Generate Data
o make_blobs() generates synthetic data with 300 samples, 4 centers (clusters),
and random variance. This is used to simulate real-world clustering problems.
o X holds the generated data points, while y is the true label (not used in K-Means,
as it’s an unsupervised learning algorithm).
• Step 3: Visualizing Data Points
o The first plt.scatter() function plots the data points before applying the clustering
algorithm. They are all gray for now, and we use this plot to see how the data
looks before clustering.
• Step 4: K-Means Clustering
o KMeans(n_clusters=4) initializes the K-Means algorithm with 4 clusters (the
number of clusters we want to form).
o kmeans.fit(X) trains the K-Means model using the data X.
• Step 5: Get Cluster Centroids and Labels
o kmeans.cluster_centers_: This gives the coordinates of the centroids (the
centers) of each of the 4 clusters.
o kmeans.labels_: This gives the predicted labels (cluster assignments) for each
data point. Each data point is assigned a label that corresponds to the cluster it
belongs to.
• Step 6: Visualize Clusters
o The second plt.scatter() function visualizes the clusters by coloring each data
point according to its assigned cluster (using the labels). The centroids are
highlighted in red with a X marker.
o This plot helps us visually confirm the clusters formed by the K-Means algorithm.
How K-Means Works:
• Initialization:
o K-Means starts by randomly initializing k centroids (cluster centers).
• Iteration:
1. Assigning Labels: For each data point, it computes the distance from the point to each
centroid and assigns the point to the nearest centroid (i.e., the cluster).
2. Recalculating Centroids: After assigning labels to all points, it recalculates the centroids
by averaging the points within each cluster.
3. Repeat: Steps 1 and 2 are repeated iteratively until the centroids no longer change (i.e.,
convergence is reached).
Naïve Bayes Program
import numpy as np
from sklearn.datasets import make_classification
class NaiveBayes:
def __init__(self):
self.class_probs = {}
self.class_means = {}
self.class_vars = {}
def fit(self, X, y):
# Get unique class labels
classes = np.unique(y)
for c in classes:
self.class_probs[c] = np.mean(y == c)
for c in classes:
X_c = X[y == c]
self.class_means[c] = np.mean(X_c, axis=0)
self.class_vars[c] = np.var(X_c, axis=0)
def gaussian_pdf(self, x, mean, var):
return (1 / np.sqrt(2 * np.pi * var)) * np.exp(-(x - mean) ** 2 / (2 * var))
def predict(self, X):
predictions = []
for sample in X:
class_probs = {}
for c in self.class_probs:
prob = np.log(self.class_probs[c]) # Log prior P(class)
for i in range(len(sample)):
prob += np.log(self.gaussian_pdf(sample[i], self.class_means[c][i],
self.class_vars[c][i]))
class_probs[c] = prob
predicted_class = max(class_probs, key=class_probs.get)
predictions.append(predicted_class)
return np.array(predictions)
X, y = make_classification(n_samples=200, n_features=2, n_classes=2, random_state=42)
nb = NaiveBayes()
nb.fit(X, y)
predictions = nb.predict(X)
accuracy = np.mean(predictions == y)
print(f"Accuracy: {accuracy * 100:.2f}%")
Explanation:
Step-by-step breakdown:
• Step 1: Importing Libraries
o numpy: Used for numerical operations (though not directly used here, it is used
in the underlying data).
o train_test_split: This function splits the dataset into a training set and a test set.
o GaussianNB: This is the Naive Bayes classifier for continuous features
(assuming Gaussian/Normal distribution).
o load_iris: A function to load the Iris dataset, which contains flower data and their
corresponding species.
o accuracy_score: This function calculates the accuracy of predictions by
comparing the predicted labels (y_pred) with the true labels (y_test).
• Step 2: Loading the Dataset
o load_iris() loads the Iris dataset, which consists of 150 samples, each containing
4 features (sepal length, sepal width, petal length, petal width) and
corresponding target labels (y), which represent three species of Iris flowers.
• Step 3: Splitting the Data
o train_test_split() divides the data into a training set and a testing set (with 70%
training and 30% testing in this case). This helps in evaluating the model on
unseen data.
• Step 4: Initializing the Naive Bayes Model
o GaussianNB() initializes the Naive Bayes classifier that assumes the features are
normally distributed (Gaussian distribution).
• Step 5: Training the Model
o naive_bayes.fit(X_train, y_train) trains the Naive Bayes model using the training
data (X_train as input features and y_train as the target labels).
• Step 6: Making Predictions
o naive_bayes.predict(X_test) predicts the labels for the test data (X_test) based
on the trained model.
• Step 7: Evaluating the Model
o accuracy_score(y_test, y_pred) compares the predicted labels (y_pred) with the
true labels (y_test) and calculates the accuracy.
How Naive Bayes Works:
Naive Bayes is a probabilistic classifier based on Bayes' Theorem, with the "naive" assumption
that all features are independent given the class label. It works by computing the probability of
each class given the features and predicting the class with the highest probability.
• Bayes' Theorem: P(C∣X)=P(X∣C)P(C)P(X)P(C|X) = \frac{P(X|C)
P(C)}{P(X)}P(C∣X)=P(X)P(X∣C)P(C) Where:
o P(C∣X)P(C|X)P(C∣X) is the posterior probability of class CCC given the features
XXX.
o P(X∣C)P(X|C)P(X∣C) is the likelihood of the features XXX given the class CCC.
o P(C)P(C)P(C) is the prior probability of class CCC.
o P(X)P(X)P(X) is the probability of the features XXX.
In practice, Naive Bayes estimates the probability of each class by assuming that the features
are conditionally independent. For the Gaussian Naive Bayes (used here), it assumes the
features are normally distributed and uses the mean and variance of each feature to calculate
the likelihood.