Machine Learning
Machine Learning
Machine learning is basically a way for computers to learn without being explicitly programmed.
Imagine you show a computer a bunch of pictures of cats and dogs. With machine learning, the
computer can learn to identify cats and dogs in new pictures on its own, without needing you to
tell it exactly what to look for.
Regression : deals with continuous values. Imagine predicting the price of a house
based on size, location, and other factors. The output, house price, can be any number
on a continuous scale.
Classification : deals with discrete categories. Think about sorting emails into "spam"
or "not spam". The output here is a category, not a specific value.
Both Regression and Classification algorithms are known as Supervised Learning
algorithms and are used to predict in Machine learning and work with labeled datasets.
However, their differing approach to Machine Learning problems is their point of
divergence.
The Regression algorithm’s task is finding the mapping function so we can map the
input variable of “x” to the continuous output variable of “y.”
Classification in Machine Learning Explained
On the other hand, Classification is an algorithm that finds functions that help divide the
dataset into classes based on various parameters. When using a Classification
algorithm, a computer program gets taught on the training dataset and categorizes the
data into various categories depending on what it learned. Classification algorithms find
the mapping function to map the “x” input to “y” discrete output. The algorithms
estimate discrete values (in other words, binary values such as 0 and 1, yes and no, true
or false, based on a particular set of independent variables. To put it another, more
straightforward way, classification algorithms predict an event occurrence probability by
fitting data to a logit function. Classification algorithms are used for things like email
and spam classification, predicting the willingness of bank customers to pay their loans,
and identifying cancer tumor cells.
Types of Classification
K-Nearest Neighbors (KNN) is a machine learning algorithm that works by finding the
closest data points to a new data point, like finding your closest neighbors in a crowded
room.
Logistic Regression: This classification type isn't complex so it can be easily adopted
with minimal training. It predicts the probability of Y being associated with the X input
variable
Let’s compare logistic regression with linear regression. Linear regression is a
prediction algorithm we learned about in the Regression section. In linear regression,
we attempt to predict the student’s exact exam score. This generates a straight line of
best fit to model the data points.
With logistic regression, we attempt to predict a class label–whether the student will
succeed or fail on their exam. Here, the line of best fit is an S-shaped curve, also known
as a Sigmoid curve.
Dataset
What is a Dataset?
There’s also often confusion between the terms dataset and database. While a database and
a dataset are both related terms used to describe the organization and management of data,
they differ in several meaningful ways:
As defined in the first section, a dataset is a collection of data used for analysis and modeling
and typically organized in a structured format.
Training model
Training data is the initial data used to train machine learning models. Training datasets
are fed to machine learning algorithms so that they can learn to make predictions, or
perform a desired task. This type of data is key, because it helps machines achieve
results and work in the right way, as shown in the graph below.
The Model: This is the computer program that learns to perform a specific task, like
recognizing faces in images or predicting house prices.
The Training Model (Data): This is the data used to teach the model. It's like the
textbooks, practice problems, and lectures the student uses to learn. The training data
typically includes examples of inputs and their corresponding desired outputs.
Linear Regression
1. Variables:
○ Dependent Variable (Y): This is the variable you're trying to predict. In a
house price example, the price of the house would be the dependent
variable.
○ Independent Variable(s) (X): These are the variables you believe influence
the dependent variable. In our example, factors like size and location are
the independent variables.
2. Linear Equation: This equation represents the best-fit line through the data
points. It typically follows the format:
○ Y = a + bx
■ Y: Predicted value of the dependent variable
■ a: Intercept (the point where the line crosses the Y-axis)
■ b: Slope (the direction and steepness of the line)
■ x: Value of the independent variable
3. Error Term (ε): This term represents the difference between the actual value of
the dependent variable (Y) and the value predicted by the equation (Ŷ). It
captures the random variations or noise in the data that the linear model doesn't
account for.
4. Model Fitting: This stage involves finding the values for the intercept (a) and
slope (b) in the equation that minimize the sum of squared errors (Σε^2). This
ensures the best possible fit of the line to the data points.
Scatter plots
Scatter plots are the graphs that present the relationship between two variables in a
data-set. It represents data points on a two-dimensional plane or on a Cartesian
system. The independent variable or attribute is plotted on the X-axis, while the
dependent variable is plotted on the Y-axis. These plots are often called scatter graphs
or scatter diagrams.
Let us understand how to construct a scatter plot with the help of the below example.
Question:
Draw a scatter plot for the given data that shows the number of games played and
scores obtained in each instance.
Solution:
For data variables such as x1, x2, x3, and xn, the scatter plot matrix presents all the
pairwise scatter plots of the variables on a single illustration with various scatterplots in
a matrix format. For the n number of variables, the scatterplot matrix will contain n rows
and n columns. A plot of variables xi vs xj will be located at the ith row and jth column
intersection. We can say that each row and column is one dimension, whereas each cell
plots a scatter plot of two dimensions.
We know that the correlation is a statistical measure of the relationship between the two
variables’ relative movements. If the variables are correlated, the points will fall along a
line or curve. The better the correlation, the closer the points will touch the line. This
cause examination tool is considered as one of the seven essential quality tools.
Types of correlation
The scatter plot explains the correlation between two attributes or variables. It
represents how closely the two variables are connected. There can be three such
situations to see the relation between the two variables –
1. Positive Correlation
2. Negative Correlation
3. No Correlation
Positive Correlation
When the points in the graph are rising, moving from left to right, then the scatter plot
shows a positive correlation. It means the values of one variable are increasing with
respect to another. Now positive correlation can further be classified into three
categories:
Negative Correlation
When the points in the scatter graph fall while moving left to right, then it is called a
negative correlation. It means the values of one variable are decreasing with respect to
another. These are also of three types:
No Correlation
When the points are scattered all over the graph and it is difficult to conclude whether
the values are increasing or decreasing, then there is no correlation between the
variables.
Sample Data
● Define x Data
use Numpy to generate an array representing the x data with a linear relationship
between X and y.
● Define y Data
use Numpy to generate the corresponding data array that has a linear
relationship. with the x data array
Examine the scatter plot to verify the linear relationship between x and y
Classification and clustering are both fundamental techniques used in machine learning
to organize data, but they differ in their approach:
Classification:
Clustering:
Scikit-learn
Bash
This command fetches the latest stable version of scikit-learn from the Python Package
Data preprocessing is a crucial step in machine learning, as it prepares your data for the
modeling process. It ensures the data is clean, consistent, and suitable for the
algorithms you intend to use. Here are some key reasons why data preprocessing is
important:
include deletion (if data is sparse), imputation (filling in missing values based on
numerical features. Categorical data (like text labels or colors) can be encoded
similar range.
● Outlier Detection and Treatment: Outliers are data points that deviate
significantly from the rest of the data. They can distort model results. You can
identify and handle outliers by capping their values, removing them entirely (if
Choosing the right model for your task is essential. Scikit-learn offers a variety of
1. Understand the problem: Identify the task you're trying to solve (classification,
regression, clustering).
2. Explore the data: Analyze the characteristics of your data to see which
4. Split data: Divide your data into training and testing sets. The training set is used
to train the model, and the testing set is used to evaluate its performance on
unseen data.
5. Train the model: Fit the training data to the chosen algorithm, allowing the model
Evaluation helps you assess how well your model performs on unseen data. Common
Once you evaluate the model, you can identify its strengths and weaknesses and
If the initial model's performance isn't satisfactory, you can try various techniques to
improve it:
within the data can enhance model performance. Feature engineering often
● Model Selection and Ensemble Techniques: If the initial model choice wasn't
can create synthetic data points to improve model performance, especially when
import numpy as np
model = LinearRegression()
model.fit(house_sizes_reshaped, prices)
prediction = model.predict([[40000]])[0]
Output :
import numpy as np
random_numbers = np.random.rand(5)
print(random_numbers)
Output :
KMeans
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
centers = kmeans.cluster_centers_
plt.scatter(centers[:,0],centers[:,1],c='red', s=200, alpha=0.5)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('KMeans Clustering')
plt.show()
Centers = 4 Centers = 6
K-means clustering is a popular unsupervised machine learning algorithm that
groups unlabeled data points into a specific number of clusters. Here's a breakdown of
how it works:
● K-means requires you to predefine the number of clusters (k) that you want the
2. Initializing centroids:
● The algorithm starts by randomly selecting k data points from the dataset. These
● K-means iteratively assigns each data point in the dataset to the closest centroid
4. Recomputing centroids:
● Once all data points are assigned to a cluster, the centroids are recalculated. The
new centroid becomes the mean of all the data points within that cluster.
● Steps 3 and 4 are repeated until a stopping criterion is met. This criterion can be:
clusters).
means data points within a cluster should be as similar as possible to each other,
● Initialization matters: The initial placement of centroids can influence the final
clustering results. Sometimes running the algorithm multiple times with different
● Choosing the right k: There's no perfect way to determine the optimal value of
k. Often, it involves evaluating the clustering results for different k values and
choosing the one that best captures the inherent structure in the data.
● Limitations: K-means assumes spherical clusters and may not work well for
Classification
There are two types of classification: binary (choosing between two classes) and multiclass
(choosing between more than two classes). In general there are different approaches to the two
types of classification, but most multiclass models will also work for binary problems.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
Output :
Accuracy 0.855
Prasad ML 49/50
Rutwik AI 40/50
data = [
{"Name": "Pranav", "Subject": "Python", "Mark": "50/50"},
{"Name": "Prasad", "Subject": "ML", "Mark": "49/50"},
{"Name": "Rutwik", "Subject": "AI", "Mark": "40/50"},
{"Name": "Mayur", "Subject": "Android dev", "Mark": "35/50"},
]
# Accessing data
for student in data:
print(f"Name: {student['Name']}, Subject: {student['Subject']}, Mark:
{student['Mark']}")
Convolutional Neural Networks (CNNs)
import cv2
# Load the pre-trained Haar cascade classifier for faces
face_cascade =
cv2.CascadeClassifier('C:/Users/Pranav/Downloads/haarcascade_frontalface_d
efault.xml')
#Load the image & Convert the image to grayscale
image =
cv2.imread("C:/Users/Pranav/Downloads/Screenshot2024-06-20145949.png")
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
cv2.imshow('Image', image)
cv2.waitKey(0)
cv2.distroyAllwindow()
Output :
Haarcascade_frontalface_default.xml :
Haarcascade_eye.xml :