0% found this document useful (0 votes)
8 views

Python for Data Science IA 1 Programs

The document provides implementations and explanations of various machine learning algorithms, including Simple Linear Regression, K-Nearest Neighbors (KNN), K-Means Clustering, and Naïve Bayes. Each section includes code snippets for generating datasets, training models, making predictions, and evaluating performance, along with step-by-step breakdowns of the processes involved. Visualizations using matplotlib are also included to illustrate the results of the algorithms.

Uploaded by

prerana.basavraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Python for Data Science IA 1 Programs

The document provides implementations and explanations of various machine learning algorithms, including Simple Linear Regression, K-Nearest Neighbors (KNN), K-Means Clustering, and Naïve Bayes. Each section includes code snippets for generating datasets, training models, making predictions, and evaluating performance, along with step-by-step breakdowns of the processes involved. Visualizations using matplotlib are also included to illustrate the results of the algorithms.

Uploaded by

prerana.basavraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Simple linear regression

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

def generate_dataset(n_samples=100):
np.random.seed(42)
X = 2 * np.random.rand(n_samples, 1)
y = 3 * X + 4 + np.random.randn(n_samples, 1)

class SimpleLinearRegression:
def __init__(self):
self.slope = None
self.intercept = None
def fit(self, X, y):
n = len(X)
X_mean = np.mean(X)
y_mean = np.mean(y)

numerator = np.sum((X - X_mean) * (y - y_mean))


denominator = np.sum((X - X_mean) ** 2)

self.slope = numerator / denominator


self.intercept = y_mean - self.slope * X_mean

def predict(self, X):


return self.slope * X + self.intercept

if __name__ == "__main__":
X, y = generate_dataset()
dataset = pd.DataFrame({
"X": X.flatten(),
"y": y.flatten()
})
print("Dataset:")
print(dataset)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

model = SimpleLinearRegression()
model.fit(X_train.flatten(), y_train.flatten())

y_pred = model.predict(X_test.flatten())
mse = mean_squared_error(y_test, y_pred)

print(f"Model Coefficients: Slope = {model.slope:.2f}, Intercept =


{model.intercept:.2f}") print(f"Mean Squared Error on Test Set:
{mse:.2f}")

plt.scatter(X, y, color="blue", label="Actual Data")


plt.plot(X, model.predict(X.flatten()), color="red", label="Regression Line")
plt.xlabel('X')
plt.ylabel('y')
plt.title('Simple Linear Regression')
plt.legend()
plt.show()

Explanation:
Step-by-step breakdown:
 Step 1: Importing Libraries
o numpy: Used for generating synthetic data and performing
numerical operations.
o matplotlib.pyplot: Used for visualizing the data and the regression
line.
o LinearRegression: This is the linear regression model from scikit-
learn that will be used to fit the data.
o train_test_split: Splits the dataset into training and testing sets.

o mean_squared_error: Used to evaluate the performance of the


model by computing the mean squared error.
 Step 2: Generating Synthetic Data
o We generate synthetic data using the equation y=3x+4y = 3x +
4y=3x+4 with some added Gaussian noise. This helps simulate real-
world data where the relationship between variables is linear but
with some randomness.
o X contains the feature values (input), and y contains the target
values (output).
 Step 3: Splitting Data
o train_test_split() divides the data into training and testing sets. 80%
of the data is used for training, and 20% is used for testing.
 Step 4: Initializing the Model
o We create an instance of the LinearRegression class to initialize the
linear regression model.
 Step 5: Training the Model
o linear_reg.fit(X_train, y_train) fits the model to the training data,
learning the coefficients (slope and intercept) that best fit the linear
relationship between X and y.
 Step 6: Making Predictions
o y_pred = linear_reg.predict(X_test) predicts the target values
(y_pred) for the test data (X_test).
 Step 7: Evaluating the Model
o Mean Squared Error (MSE) is used to measure how well the
model fits the data. A lower MSE indicates a better fit.
o R-squared measures the proportion of the variance in the target
variable that is predictable from the features. A value closer to 1
indicates a good fit.
 Step 8: Visualizing the Results
o We use matplotlib to visualize the test data points (X_test vs. y_test)
and the regression line that represents the model's predictions.

How Linear Regression Works:


Linear regression attempts to model the relationship between a dependent
variable y and an independent variable X by fitting a straight line to the data.
The relationship is described by the equation:

y=β0+β1⋅Xy = \beta_0 + \beta_1 \cdot Xy=β0+β1⋅X


Where:
 y is the target variable (output),
 X is the input feature (independent variable),
 β0\beta_0β0 is the intercept (where the line crosses the y-axis),
 β1\beta_1β1 is the slope of the line.
The goal of the algorithm is to find the values of β0\beta_0β0 and β1\beta_1β1
that minimize the difference between the predicted values ypredy_{pred}ypred
and the actual values yyy (using a loss function like Mean Squared Error).

KNN program
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

class KNNClassifier:
def __init__(self, k=3):
self.k = k self.X_train = None
self.y_train = None

def fit(self, X, y):


self.X_train = X
self.y_train = y

def predict(self, X):


predictions = []
for x in X:
distances = np.linalg.norm(self.X_train - x, axis=1)
nearest_indices = distances.argsort()[:self.k]
nearest_labels = self.y_train[nearest_indices]
prediction = np.bincount(nearest_labels).argmax()
predictions.append(prediction)
return np.array(predictions)

iris = load_iris()
iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']]
columns=iris['feature_names'] + ['target'])
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

knn = KNNClassifier(k=3)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

print("\nPredictions:")
for i, (true_label, pred_label) in enumerate(zip(y_test, y_pred)):
status = "Correct" if true_label == pred_label else "Incorrect"
print(f"Test Sample {i + 1}: True Label = {true_label}, Predicted =
{pred_label}, {status}")

Explanation:
 Step 1: Importing Libraries
o numpy: Used for handling arrays and matrix operations.

o train_test_split: This function splits the dataset into training and


testing subsets.
o KNeighborsClassifier: This is the KNN classifier from scikit-learn that
will be used for training and prediction.
o load_iris: A function to load the Iris dataset, which is a classic
dataset used for classification tasks.
o accuracy_score: This function calculates the accuracy of predictions
by comparing the predicted labels to the true labels.
 Step 2: Loading the Dataset
o We use load_iris() to load the Iris dataset, which is a simple
classification dataset where the goal is to predict the type of iris
flower (Setosa, Versicolour, or Virginica) based on four features
(sepal length, sepal width, petal length, petal width).
 Step 3: Splitting the Data
o train_test_split() splits the data into a training set and a test set
(80% for training and 20% for testing in this case). It shuffles the
data and ensures that we evaluate the model on unseen data.
 Step 4: Creating the KNN Classifier
o We create an instance of KNeighborsClassifier, setting the number
of neighbors n_neighbors=3. This means that the class of a new
data point will be predicted based on the majority class among the
3 nearest neighbors.
 Step 5: Training the Model
o knn.fit(X_train, y_train) trains the model using the training dataset
(X_train as input features and y_train as target labels).
 Step 6: Making Predictions
o knn.predict(X_test) makes predictions on the test data based on the
trained model. The X_test is the feature set for which we want to
predict the class labels.
 Step 7: Evaluating the Model
o accuracy_score(y_test, y_pred) compares the predicted labels
(y_pred) with the true labels (y_test) and calculates the accuracy.

How KNN Works:


KNN is a simple yet powerful classification algorithm:
 For each test data point:
1. It calculates the distance (usually Euclidean distance) from that
point to every other point in the training set.
2. Then, it selects the k nearest points.
3. The majority class among the k nearest neighbors is taken as the
prediction for the test data point.
K-means program

import numpy as np
import matplotlib.pyplot as plt

def initialize_centroids(X, k):


return X[np.random.choice(X.shape[0], k, replace=False)]

def compute_distance(a, b):


return np.sqrt(np.sum((a - b) ** 2))

def assign_clusters(X, centroids):


clusters = []
for point in X:
distances = [compute_distance(point, centroid) for centroid in centroids]
cluster = np.argmin(distances) # Find the index of the nearest centroid
clusters.append(cluster)
return np.array(clusters)

def update_centroids(X, clusters, k):


new_centroids = np.zeros((k, X.shape[1]))
for i in range(k):
new_centroids[i] = np.mean(X[clusters == i], axis=0)
return new_centroids

def k_means(X, k, max_iters=100, tolerance=1e-4):


centroids = initialize_centroids(X, k)
for i in range(max_iters):
clusters = assign_clusters(X, centroids)
new_centroids = update_centroids(X, clusters, k)
if np.all(np.abs(new_centroids - centroids) < tolerance):
print(f"Converged at iteration {i}")
break
centroids = new_centroids
return centroids, clusters

from sklearn.datasets import make_blobs


X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60,
random_state=0)

k=4
centroids, clusters = k_means(X, k)

plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis')


plt.scatter(centroids[:, 0], centroids[:, 1], s=300, c='red', marker='X')
plt.title("K-Means Clustering (from scratch)")
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

Explanation:
Step-by-step breakdown:
 Step 1: Importing Libraries
o numpy: For handling numerical data and matrix operations.

o matplotlib.pyplot: For visualizing the data points and clusters.

o make_blobs: A function to generate synthetic data with a specified


number of clusters.
o KMeans: The KMeans algorithm from scikit-learn for clustering the
data.
 Step 2: Generate Data
o make_blobs() generates synthetic data with 300 samples, 4 centers
(clusters), and random variance. This is used to simulate real-world
clustering problems.
o X holds the generated data points, while y is the true label (not
used in K-Means, as it’s an unsupervised learning algorithm).
 Step 3: Visualizing Data Points
o The first plt.scatter() function plots the data points before applying
the clustering algorithm. They are all gray for now, and we use this
plot to see how the data looks before clustering.
 Step 4: K-Means Clustering
o KMeans(n_clusters=4) initializes the K-Means algorithm with 4
clusters (the number of clusters we want to form).
o kmeans.fit(X) trains the K-Means model using the data X.

 Step 5: Get Cluster Centroids and Labels


o kmeans.cluster_centers_: This gives the coordinates of the centroids
(the centers) of each of the 4 clusters.
o kmeans.labels_: This gives the predicted labels (cluster
assignments) for each data point. Each data point is assigned a
label that corresponds to the cluster it belongs to.
 Step 6: Visualize Clusters
o The second plt.scatter() function visualizes the clusters by coloring
each data point according to its assigned cluster (using the labels).
The centroids are highlighted in red with a X marker.
o This plot helps us visually confirm the clusters formed by the K-
Means algorithm.

How K-Means Works:


 Initialization:
o K-Means starts by randomly initializing k centroids (cluster centers).

 Iteration:
1. Assigning Labels: For each data point, it computes the distance from
the point to each centroid and assigns the point to the nearest
centroid (i.e., the cluster).
2. Recalculating Centroids: After assigning labels to all points, it
recalculates the centroids by averaging the points within each
cluster.
3. Repeat: Steps 1 and 2 are repeated iteratively until the centroids no
longer change (i.e., convergence is reached).
Naïve Bayes Program

import numpy as np
from sklearn.datasets import make_classification

class NaiveBayes:
def __init__(self):
self.class_probs = {}
self.class_means = {}
self.class_vars = {}

def fit(self, X, y):


# Get unique class labels
classes = np.unique(y)

for c in classes:
self.class_probs[c] = np.mean(y == c)

for c in classes:
X_c = X[y == c]
self.class_means[c] = np.mean(X_c, axis=0)
self.class_vars[c] = np.var(X_c, axis=0)

def gaussian_pdf(self, x, mean, var):


return (1 / np.sqrt(2 * np.pi * var)) * np.exp(-(x - mean) ** 2 / (2 * var))

def predict(self, X):


predictions = []
for sample in X:
class_probs = {}
for c in self.class_probs:
prob = np.log(self.class_probs[c]) # Log prior P(class)
for i in range(len(sample)):
prob += np.log(self.gaussian_pdf(sample[i], self.class_means[c][i],
self.class_vars[c][i]))
class_probs[c] = prob
predicted_class = max(class_probs, key=class_probs.get)
predictions.append(predicted_class)

return np.array(predictions)

X, y = make_classification(n_samples=200, n_features=2, n_classes=2,


random_state=42)

nb = NaiveBayes()
nb.fit(X, y)

predictions = nb.predict(X)

accuracy = np.mean(predictions == y)
print(f"Accuracy: {accuracy * 100:.2f}%")

Explanation:
Step-by-step breakdown:
 Step 1: Importing Libraries
o numpy: Used for numerical operations (though not directly used
here, it is used in the underlying data).
o train_test_split: This function splits the dataset into a training set
and a test set.
o GaussianNB: This is the Naive Bayes classifier for continuous
features (assuming Gaussian/Normal distribution).
o load_iris: A function to load the Iris dataset, which contains flower
data and their corresponding species.
o accuracy_score: This function calculates the accuracy of predictions
by comparing the predicted labels (y_pred) with the true labels
(y_test).
 Step 2: Loading the Dataset
o load_iris() loads the Iris dataset, which consists of 150 samples,
each containing 4 features (sepal length, sepal width, petal length,
petal width) and corresponding target labels (y), which represent
three species of Iris flowers.
 Step 3: Splitting the Data
o train_test_split() divides the data into a training set and a testing
set (with 70% training and 30% testing in this case). This helps in
evaluating the model on unseen data.
 Step 4: Initializing the Naive Bayes Model
o GaussianNB() initializes the Naive Bayes classifier that assumes the
features are normally distributed (Gaussian distribution).
 Step 5: Training the Model
o naive_bayes.fit(X_train, y_train) trains the Naive Bayes model using
the training data (X_train as input features and y_train as the target
labels).
 Step 6: Making Predictions
o naive_bayes.predict(X_test) predicts the labels for the test data
(X_test) based on the trained model.
 Step 7: Evaluating the Model
o accuracy_score(y_test, y_pred) compares the predicted labels
(y_pred) with the true labels (y_test) and calculates the accuracy.

How Naive Bayes Works:


Naive Bayes is a probabilistic classifier based on Bayes' Theorem, with the
"naive" assumption that all features are independent given the class label. It
works by computing the probability of each class given the features and
predicting the class with the highest probability.

 Bayes' Theorem: P(C∣X)=P(X∣C)P(C)P(X)P(C|X) = \frac{P(X|C) P(C)}


{P(X)}P(C∣X)=P(X)P(X∣C)P(C) Where:

o P(C∣X)P(C|X)P(C∣X) is the posterior probability of class CCC given the


features XXX.
o P(X∣C)P(X|C)P(X∣C) is the likelihood of the features XXX given the
class CCC.
o P(C)P(C)P(C) is the prior probability of class CCC.

o P(X)P(X)P(X) is the probability of the features XXX.

In practice, Naive Bayes estimates the probability of each class by assuming that
the features are conditionally independent. For the Gaussian Naive Bayes (used
here), it assumes the features are normally distributed and uses the mean and
variance of each feature to calculate the likelihood.

You might also like