Krishna Edx Machine Learning With Python
Krishna Edx Machine Learning With Python
This is to certify that this report on “Machine Learning With Python” is submitted by the
intern in the seventh semester for the partial fulfilment of the requirements
for the award of the bachelor’s
Degree by
JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY GURAJADA
VIZIANAGARAM
In the Department of Electrical and Electronics Engineering
Submitted by
YELLE KRISHNABABU
(22NT5A0265)
DECLARATION
I hereby declare that the course entitled “Machine Learning with Python” is an
authentic record of my own learning carried out at Edx as a requirement of a semester for the
award of the degree of B.Tech.(Department of Electrical and Electronics Engineering), Visakha
Institute of Engineering and Technology under the guidance of Mr K S B Varaprasad, HoD
(EEE Department)
Y Krishnababu
CERTIFICATE OF COMPLETION
INDEX Pg No
Page |1
Manufacturing: Predictive maintenance, quality control, supply chain optimization.
Python for Machine Learning
Python is a popular language for machine learning due to its simplicity and the powerful libraries it
offers. Here are some key libraries:
1. NumPy:
o Purpose: Provides support for large, multi-dimensional arrays and matrices, along
with a collection of mathematical functions to operate on these arrays.
o Example: Efficiently performing mathematical operations on arrays.
2. Pandas:
o Purpose: Offers data structures and operations for manipulating numerical tables and
time series.
o Example: Data cleaning, manipulation, and analysis.
3. Matplotlib:
o Purpose: A plotting library for creating static, animated, and interactive
visualizations in Python.
o Example: Creating line plots, scatter plots, histograms, etc.
4. Scikit-learn:
o Purpose: Provides simple and efficient tools for data mining and data analysis, built
on NumPy, SciPy, and Matplotlib.
o Example: Implementing machine learning algorithms like regression, classification,
clustering, and more.
Page |2
MODULE – 2: REGRESSION
Linear Regression
Definition: Linear regression is a statistical method used to model the relationship between a
dependent variable (target) and one or more independent variables (predictors) by fitting a linear
equation to observed data. The simplest form is the equation of a straight line:
y = mx + b
where ( y ) is the dependent variable, ( x ) is the independent variable, ( m ) is the slope, and ( b ) is
the y-intercept.
Implementation in Python:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
# Sample data
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([1, 3, 2, 5, 4])
# Create and fit the model
model = LinearRegression()
model.fit(X, y)
# Predictions
y_pred = model.predict(X)
# Plotting
plt.scatter(X, y, color='blue')
plt.plot(X, y_pred, color='red')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression')
plt.show()
Multiple Linear Regression
Definition: Multiple linear regression extends simple linear regression by using multiple independent
variables to predict the dependent variable. The equation is:
y = b_0 + b_1x_1 + b_2x_2 + ... + b_nx_n
where ( y ) is the dependent variable, ( x_1, x_2, …, x_n ) are the independent variables, and ( b_0,
b_1, …, b_n ) are the coefficients.
Page |3
Implementation in Python:
# Sample data
X = np.array([[1, 1], [2, 2], [3, 3], [4, 4], [5, 5]])
y = np.array([1, 3, 2, 5, 4])
# Create and fit the model
model = LinearRegression()
model.fit(X, y)
# Predictions
y_pred = model.predict(X)
Polynomial Regression
Definition: Polynomial regression fits a polynomial equation to the data, which can capture non-
linear relationships. The equation is:
y = b_0 + b_1x + b_2x^2 + ... + b_nx^n
Implementation in Python:
from sklearn.preprocessing import PolynomialFeatures
# Sample data
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([1, 3, 2, 5, 4])
# Transform the features to polynomial features
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
# Create and fit the model
model = LinearRegression()
model.fit(X_poly, y)
# Predictions
y_pred = model.predict(X_poly)
# Plotting
plt.scatter(X, y, color='blue')
plt.plot(X, y_pred, color='red')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Polynomial Regression')
plt.show()
Page |4
Evaluation Metrics
1. Mean Absolute Error (MAE): The average of the absolute differences between predicted
and actual values.
MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
2. Mean Squared Error (MSE): The average of the squared differences between predicted and
actual values.
MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
3. R-squared (R²): A statistical measure that represents the proportion of the variance for the
dependent variable that’s explained by the independent variables.
R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}
Page |5
MODULE – 3: CLASSIFICATION
Logistic Regression
Definition: Logistic regression is used for binary classification problems, where the outcome is a
binary variable (0 or 1). It predicts the probability of the occurrence of an event by fitting data to a
logistic function (sigmoid function).
Implementation in Python:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Sample data
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]])
y = np.array([0, 0, 1, 1, 1])
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and fit the model
model = LogisticRegression()
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
K-Nearest Neighbors (KNN)
Definition: KNN is an instance-based learning algorithm that classifies a data point based on the
majority class of its nearest neighbors. It is simple and effective for small datasets.
Implementation in Python:
from sklearn.neighbors import KNeighborsClassifier
# Sample data
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]])
y = np.array([0, 0, 1, 1, 1])
Page |6
# Create and fit the model
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X, y)
# Predictions
y_pred = model.predict(X)
print(f'Predictions: {y_pred}')
Support Vector Machines (SVM)
Definition: SVM finds the hyperplane that best separates the classes in the feature space. It is
effective in high-dimensional spaces and for cases where the number of dimensions exceeds the
number of samples.
Implementation in Python:
from sklearn.svm import SVC
# Sample data
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]])
y = np.array([0, 0, 1, 1, 1])
# Create and fit the model
model = SVC(kernel='linear')
model.fit(X, y)
# Predictions
y_pred = model.predict(X)
print(f'Predictions: {y_pred}')
Decision Trees and Random Forests
Decision Trees:
Definition: Decision trees split the data into subsets based on feature values, creating a tree-
like model of decisions. Each node represents a feature, each branch represents a decision
rule, and each leaf represents an outcome.
Implementation in Python:
from sklearn.tree import DecisionTreeClassifier
# Sample data
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]])
y = np.array([0, 0, 1, 1, 1])
Page |7
model.fit(X, y)
# Predictions
y_pred = model.predict(X)
print(f'Predictions: {y_pred}')
Random Forests:
Definition: Random forests use multiple decision trees to improve accuracy and control over-
fitting. Each tree is trained on a random subset of the data, and the final prediction is made by
averaging the predictions of all trees.
Implementation in Python:
from sklearn.ensemble import RandomForestClassifier
# Sample data
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]])
y = np.array([0, 0, 1, 1, 1])
# Create and fit the model
model = RandomForestClassifier(n_estimators=100)
model.fit(X, y)
# Predictions
y_pred = model.predict(X)
print(f'Predictions: {y_pred}')
Evaluation Metrics
1. Accuracy: The ratio of correctly predicted instances to the total instances.
\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}
2. Precision: The ratio of correctly predicted positive observations to the total predicted
positives.
\text{Precision} = \frac{TP}{TP + FP}
where ( TP ) is true positives and ( FP ) is false positives.
3. Recall: The ratio of correctly predicted positive observations to all observations in the actual
class.
\text{Recall} = \frac{TP}{TP + FN}
where ( FN ) is false negatives.
4. F1 Score: The weighted average of Precision and Recall.
F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
5. ROC-AUC: The Area Under the Receiver Operating Characteristic Curve, which plots the
true positive rate against the false positive rate.
Page |8
MODULE – 4: CLUSTERING
K-Means Clustering
Definition: K-Means is a popular clustering algorithm that partitions data into ( K ) clusters. Each
data point belongs to the cluster with the nearest mean, which serves as the cluster’s centroid.
Steps:
1. Initialize ( K ) centroids randomly.
2. Assign each data point to the nearest centroid.
3. Recalculate the centroids as the mean of all points in the cluster.
4. Repeat steps 2 and 3 until the centroids no longer change significantly.
Implementation in Python:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# Sample data
X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])
# Create and fit the model
kmeans = KMeans(n_clusters=2, random_state=0)
kmeans.fit(X)
# Predictions
y_kmeans = kmeans.predict(X)
# Plotting
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75)
plt.title('K-Means Clustering')
plt.show()
Hierarchical Clustering
Definition: Hierarchical clustering builds a hierarchy of clusters using either agglomerative (bottom-
up) or divisive (top-down) approaches. Agglomerative clustering starts with each data point as a
single cluster and merges the closest pairs of clusters iteratively.
Steps:
1. Start with each data point as its own cluster.
2. Merge the two closest clusters.
Page |9
3. Repeat step 2 until all points are in a single cluster.
Implementation in Python:
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
# Sample data
X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])
# Perform hierarchical clustering
Z = linkage(X, 'ward')
# Plotting the dendrogram
plt.figure(figsize=(10, 7))
dendrogram(Z)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample index')
plt.ylabel('Distance')
plt.show()
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
Definition: DBSCAN clusters data based on density. It can find arbitrarily shaped clusters and is
robust to noise (outliers).
Steps:
1. For each point, find the points within a specified radius (epsilon).
2. If a point has more neighbors than a specified threshold (min_samples), it is a core point and
forms a cluster.
3. Expand the cluster by including all density-reachable points.
4. Repeat until all points are processed.
Implementation in Python:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
# Sample data
X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0], [10, 10]])
# Create and fit the model
dbscan = DBSCAN(eps=1.5, min_samples=2)
P a g e | 10
dbscan.fit(X)
# Predictions
y_dbscan = dbscan.labels_
# Plotting
plt.scatter(X[:, 0], X[:, 1], c=y_dbscan, s=50, cmap='viridis')
plt.title('DBSCAN Clustering')
plt.show()
Evaluation Metrics
1. Silhouette Score: Measures how similar a point is to its own cluster compared to other
clusters. Values range from -1 to 1, where a higher value indicates better clustering.
\text{Silhouette Score} = \frac{b - a}{\max(a, b)}
where ( a ) is the average distance to other points in the same cluster, and ( b ) is the average distance
to points in the nearest cluster.
2. Davies-Bouldin Index: Measures the average similarity ratio of each cluster with its most
similar cluster. Lower values indicate better clustering.
\text{DB Index} = \frac{1}{n} \sum_{i=1}^{n} \max_{j \neq i} \left( \frac{s_i + s_j}{d_{ij}} \right)
where ( s_i ) and ( s_j ) are the cluster dispersions and ( d_{ij} ) is the distance between cluster
centroids.
P a g e | 11
MODULE – 5: RECOMMENDER SYSTEMS
Introduction to Recommender Systems
Recommender systems are algorithms designed to suggest relevant items to users. These items can be
anything from movies, books, and products to friends on social media. The main goal is to provide
personalized recommendations to enhance user experience.
Types of Recommender Systems:
1. Collaborative Filtering: Based on user-item interactions. It can be further divided into:
o User-Based Collaborative Filtering: Recommends items by finding similar users.
o Item-Based Collaborative Filtering: Recommends items by finding similar items.
2. Content-Based Filtering: Recommends items based on the features of the items and the
user’s past preferences. For example, if a user likes action movies, the system will
recommend other action movies.
3. Hybrid Methods: Combine collaborative and content-based filtering to leverage the strengths
of both approaches.
Collaborative Filtering
User-Based Collaborative Filtering:
Concept: Finds users who are similar to the target user and recommends items that those
similar users have liked.
Example: If User A and User B have similar tastes, and User A likes a new item, User B is
likely to like it too.
Item-Based Collaborative Filtering:
Concept: Finds items that are similar to the items the target user has liked and recommends
those similar items.
Example: If a user likes a particular movie, the system will recommend other movies that are
similar to it.
Implementation in Python:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
# Sample user-item interaction matrix
data = {'User1': [5, 3, 0, 1],
'User2': [4, 0, 0, 1],
'User3': [1, 1, 0, 5],
'User4': [1, 0, 0, 4],
'User5': [0, 1, 5, 4]}
P a g e | 12
df = pd.DataFrame(data, index=['Item1', 'Item2', 'Item3', 'Item4'])
# Compute cosine similarity between users
user_similarity = cosine_similarity(df.T)
print(user_similarity)
Content-Based Filtering
Concept: Recommends items based on the features of the items and the user’s past preferences. It
uses item metadata (e.g., genre, director, actors for movies) to make recommendations.
Implementation in Python:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
# Sample item metadata
items = ['Action movie with lots of explosions',
'Romantic comedy with a happy ending',
'Action movie with a hero saving the day',
'Drama with a deep storyline']
# Vectorize the item descriptions
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(items)
# Compute cosine similarity between items
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
print(cosine_sim)
Matrix Factorization Techniques
Singular Value Decomposition (SVD):
Concept: Decomposes the user-item interaction matrix into three matrices, capturing latent
factors that explain the observed interactions.
Example: Used in collaborative filtering to predict missing entries in the user-item matrix.
Alternating Least Squares (ALS):
Concept: An optimization algorithm used to factorize the user-item interaction matrix by
alternating between fixing user factors and item factors to minimize the error.
Example: Commonly used in large-scale recommender systems.
Implementation in Python:
from scipy.sparse.linalg import svds
# Sample user-item interaction matrix
P a g e | 13
R = np.array([[5, 3, 0, 1],
[4, 0, 0, 1],
[1, 1, 0, 5],
[1, 0, 0, 4],
[0, 1, 5, 4]])
# Perform SVD
U, sigma, Vt = svds(R, k=2)
sigma = np.diag(sigma)
# Reconstruct the matrix
R_pred = np.dot(np.dot(U, sigma), Vt)
print(R_pred)
Evaluation Metrics
1. Precision: The ratio of correctly recommended items to the total recommended items.
\text{Precision} = \frac{TP}{TP + FP}
2. Recall: The ratio of correctly recommended items to the total relevant items.
\text{Recall} = \frac{TP}{TP + FN}
3. F1 Score: The harmonic mean of Precision and Recall.
F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
4. Mean Squared Error (MSE): The average of the squared differences between predicted and
actual ratings.
MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
P a g e | 14