0% found this document useful (0 votes)
12 views

Machine Learning

Uploaded by

Pranav Phalke
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Machine Learning

Uploaded by

Pranav Phalke
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Machine learning

What is machine learning ?

Machine learning is basically a way for computers to learn without being explicitly programmed.
Imagine you show a computer a bunch of pictures of cats and dogs. With machine learning, the
computer can learn to identify cats and dogs in new pictures on its own, without needing you to
tell it exactly what to look for.

● Machines (like computers)


● Learn (improve without being explicitly programmed)
● From data (like pictures, numbers, or text)

How machine learning can learn from data

Machine learning offers a number of different ways to learn from data:

Supervised learning : it can be regarded as a “hands-on” approach, since it uses


labeled data. Humans must tag, label, or annotate the data to their criteria, in order to
train the model to predict the “correct” outputs which are predetermined.
Unsupervised learning : it can be construed as a “broad pattern-seeking” approach,
since it uses unlabeled data and, instead of predicting the correct output, models are
tasked with finding patterns, similarities and deviations, that can be then applied to other
data that exhibit similar behavior.

Reinforcement learning: it uses unlabeled data and it involves a feedback mechanism.


When it performs a task correctly, it receives positive feedback, which strengthens the
model in connecting the target inputs and output. Likewise, it can receive negative
feedback for incorrect solutions.

Regression : deals with continuous values. Imagine predicting the price of a house
based on size, location, and other factors. The output, house price, can be any number
on a continuous scale.

Classification : deals with discrete categories. Think about sorting emails into "spam"
or "not spam". The output here is a category, not a specific value.
Both Regression and Classification algorithms are known as Supervised Learning
algorithms and are used to predict in Machine learning and work with labeled datasets.
However, their differing approach to Machine Learning problems is their point of
divergence.

Regression in Machine Learning Explained

Regression finds correlations between dependent and independent variables. Therefore,


regression algorithms help predict continuous variables such as house prices, market
trends, weather patterns, oil and gas prices (a critical task these days!), etc.

The Regression algorithm’s task is finding the mapping function so we can map the
input variable of “x” to the continuous output variable of “y.”
Classification in Machine Learning Explained

On the other hand, Classification is an algorithm that finds functions that help divide the
dataset into classes based on various parameters. When using a Classification
algorithm, a computer program gets taught on the training dataset and categorizes the
data into various categories depending on what it learned. Classification algorithms find
the mapping function to map the “x” input to “y” discrete output. The algorithms
estimate discrete values (in other words, binary values such as 0 and 1, yes and no, true
or false, based on a particular set of independent variables. To put it another, more
straightforward way, classification algorithms predict an event occurrence probability by
fitting data to a logit function. Classification algorithms are used for things like email
and spam classification, predicting the willingness of bank customers to pay their loans,
and identifying cancer tumor cells.

Types of Classification

K-Nearest Neighbors: This Classification type identifies the K nearest neighbors to a


given observation point. It then uses K points to evaluate the proportions of each type of
target variable and predicts the target variable that has the highest ratio.
K-Nearest Neighbor (KNN )

K-Nearest Neighbors (KNN) is a machine learning algorithm that works by finding the
closest data points to a new data point, like finding your closest neighbors in a crowded
room.

Here are some key things to remember about KNN:

● It's simple and easy to understand.


● It can handle both classification and regression problems.
● It doesn't require any assumptions about the data.
● The choice of K is important. Too high and it might not capture the local patterns
well, too low and it might be sensitive to noise.

Logistic Regression: This classification type isn't complex so it can be easily adopted
with minimal training. It predicts the probability of Y being associated with the X input
variable
Let’s compare logistic regression with linear regression. Linear regression is a
prediction algorithm we learned about in the Regression section. In linear regression,
we attempt to predict the student’s exact exam score. This generates a straight line of
best fit to model the data points.

With logistic regression, we attempt to predict a class label–whether the student will
succeed or fail on their exam. Here, the line of best fit is an S-shaped curve, also known
as a Sigmoid curve.

In logistic regression, we use this S-shaped curve to predict the likelihood, or


probability, of a data point belonging to the “succeed” category. All data points will
ultimately be predicted as either “succeed” or “fail”.
For example, if the threshold was set at the halfway mark, or 0.5, the student would be
classified as “fail”. If we changed the threshold, to say 0.4, then the student would be
classified as “succeed”.

We could use a confusion matrix to help us determine the optimal threshold. As we


discovered in the lesson, Model Evaluation, a confusion matrix is a table layout used to
evaluate any classification model.

Dataset
What is a Dataset?

A dataset is a structured collection of data organized and stored together for


analysis or processing. The data within a dataset is typically related in some way and
taken from a single source or intended for a single project. For example, a dataset might
contain a collection of business data (sales figures, customer contact information,
transactions, etc.). A dataset can include many different types of data, from numerical
values to text, images or audio recordings. The data within a dataset can typically be
accessed individually, in combination or managed as a whole entity.

There’s also often confusion between the terms dataset and database. While a database and
a dataset are both related terms used to describe the organization and management of data,
they differ in several meaningful ways:

As defined in the first section, a dataset is a collection of data used for analysis and modeling
and typically organized in a structured format.
Training model

What is a training model ?

Training data is the initial data used to train machine learning models. Training datasets
are fed to machine learning algorithms so that they can learn to make predictions, or
perform a desired task. This type of data is key, because it helps machines achieve
results and work in the right way, as shown in the graph below.

The Model: This is the computer program that learns to perform a specific task, like
recognizing faces in images or predicting house prices.

The Training Model (Data): This is the data used to teach the model. It's like the
textbooks, practice problems, and lectures the student uses to learn. The training data
typically includes examples of inputs and their corresponding desired outputs.
Linear Regression

What Is Linear Regression?

Fig. Linear Regression

Linear regression is an algorithm that provides a linear relationship between


an independent variable and a dependent variable to predict the outcome of
future events. It is a statistical method used in data science and machine learning
for predictive analysis. Linear regression relies on a few key components to achieve
its goal of understanding the relationship between variables and making predictions.
Here's a breakdown of the essential components:

1. Variables:
○ Dependent Variable (Y): This is the variable you're trying to predict. In a
house price example, the price of the house would be the dependent
variable.
○ Independent Variable(s) (X): These are the variables you believe influence
the dependent variable. In our example, factors like size and location are
the independent variables.

2. Linear Equation: This equation represents the best-fit line through the data
points. It typically follows the format:
○ Y = a + bx
■ Y: Predicted value of the dependent variable
■ a: Intercept (the point where the line crosses the Y-axis)
■ b: Slope (the direction and steepness of the line)
■ x: Value of the independent variable

3. Error Term (ε): This term represents the difference between the actual value of
the dependent variable (Y) and the value predicted by the equation (Ŷ). It
captures the random variations or noise in the data that the linear model doesn't
account for.

4. Model Fitting: This stage involves finding the values for the intercept (a) and
slope (b) in the equation that minimize the sum of squared errors (Σε^2). This
ensures the best possible fit of the line to the data points.

5. Assumptions: Linear regression makes certain assumptions about the data:


○ Linear Relationship: The relationship between the independent and
dependent variables should be roughly linear.
○ Homoscedasticity: The variance of the errors should be constant across
all values of the independent variable.
○ Normality: The errors should be normally distributed.
○ Independence: The errors should be independent of each other.

Scatter plots

What are Scatter plots ?

Scatter plots are the graphs that present the relationship between two variables in a
data-set. It represents data points on a two-dimensional plane or on a Cartesian
system. The independent variable or attribute is plotted on the X-axis, while the
dependent variable is plotted on the Y-axis. These plots are often called scatter graphs
or scatter diagrams.

Scatter plot Example

Let us understand how to construct a scatter plot with the help of the below example.

Question:

Draw a scatter plot for the given data that shows the number of games played and
scores obtained in each instance.
Solution:

X-axis or horizontal axis: Number of games

Y-axis or vertical axis: Scores

Now, the scatter graph will be:

For data variables such as x1, x2, x3, and xn, the scatter plot matrix presents all the
pairwise scatter plots of the variables on a single illustration with various scatterplots in
a matrix format. For the n number of variables, the scatterplot matrix will contain n rows
and n columns. A plot of variables xi vs xj will be located at the ith row and jth column
intersection. We can say that each row and column is one dimension, whereas each cell
plots a scatter plot of two dimensions.

Scatter plot Correlation

We know that the correlation is a statistical measure of the relationship between the two
variables’ relative movements. If the variables are correlated, the points will fall along a
line or curve. The better the correlation, the closer the points will touch the line. This
cause examination tool is considered as one of the seven essential quality tools.

Types of correlation
The scatter plot explains the correlation between two attributes or variables. It
represents how closely the two variables are connected. There can be three such
situations to see the relation between the two variables –

1. Positive Correlation
2. Negative Correlation
3. No Correlation

Positive Correlation

When the points in the graph are rising, moving from left to right, then the scatter plot
shows a positive correlation. It means the values of one variable are increasing with
respect to another. Now positive correlation can further be classified into three
categories:

● Perfect Positive – Which represents a perfectly straight line


● High Positive – All points are nearby
● Low Positive – When all the points are scattered

Negative Correlation
When the points in the scatter graph fall while moving left to right, then it is called a
negative correlation. It means the values of one variable are decreasing with respect to
another. These are also of three types:

● Perfect Negative – Which form almost a straight line


● High Negative – When points are near to one another
● Low Negative – When points are in scattered form

No Correlation

When the points are scattered all over the graph and it is difficult to conclude whether
the values are increasing or decreasing, then there is no correlation between the
variables.
Sample Data

● Define x Data

use Numpy to generate an array representing the x data with a linear relationship
between X and y.

● Define y Data

use Numpy to generate the corresponding data array that has a linear
relationship. with the x data array

● Plot the Data


Use Matplotlib to create a Scatter plot visualizing the relationship between x and
y.

● Review the Plot

Examine the scatter plot to verify the linear relationship between x and y

Classification vs. Clustering

Classification and clustering are both fundamental techniques used in machine learning
to organize data, but they differ in their approach:

Classification:

● Supervised Learning: Classification is a supervised learning technique. This


means it requires labeled data for training. Labeled data consists of data points
where each point has a pre-assigned category or class label.
● Goal: The goal of classification is to assign new data points to predefined
classes based on the patterns learned from the labeled training data.
● Example: Imagine sorting emails into "spam" or "not spam" categories. The
training data would include emails already labeled as spam or not spam. The
classification algorithm would then learn the characteristics of spam emails and
use that knowledge to classify new incoming emails.

Clustering:

● Unsupervised Learning: Clustering is an unsupervised learning technique. This


means it doesn't require labeled data. The algorithm identifies inherent patterns
or groupings within the data itself.
● Goal: The goal of clustering is to group data points together based on their
similarities. These groups, called clusters, are not predefined but discovered by
the algorithm.
● Example: Imagine grouping customers based on their purchasing habits. The
clustering algorithm would analyze customer data like purchase history and
demographics to identify clusters of customers with similar buying behaviors.

Scikit-learn

Scikit-learn Installation and Setup (Step 1):


The recommended method for installing scikit-learn is using pip:

Bash

pip install scikit-learn

This command fetches the latest stable version of scikit-learn from the Python Package

Index (PyPI) and installs it into your active Python environment.

Importance of Data Preprocessing (Step 2):

Data preprocessing is a crucial step in machine learning, as it prepares your data for the

modeling process. It ensures the data is clean, consistent, and suitable for the

algorithms you intend to use. Here are some key reasons why data preprocessing is

important:

● Handling Missing Values: Missing data can significantly impact model

performance. Common preprocessing techniques for handling missing values

include deletion (if data is sparse), imputation (filling in missing values based on

statistical methods or existing data), or feature engineering (creating new

features to represent missing values).

● Encoding Categorical Data: Many machine learning algorithms require

numerical features. Categorical data (like text labels or colors) can be encoded

using techniques like one-hot encoding or label encoding.

● Feature Scaling: If features have different scales (e.g., one in centimeters,

another in meters), it can affect the performance of some algorithms. Feature

scaling techniques like standardization or normalization bring all features to a

similar range.
● Outlier Detection and Treatment: Outliers are data points that deviate

significantly from the rest of the data. They can distort model results. You can

identify and handle outliers by capping their values, removing them entirely (if

justified), or transforming them using techniques like winsorization.

Model Selection and Training (Step 3):

Choosing the right model for your task is essential. Scikit-learn offers a variety of

algorithms for different machine learning problems, including:

● Classification: Logistic regression, decision trees, support vector machines

(SVMs), k-nearest neighbors (KNN), random forests, gradient boosting, etc.

● Regression: Linear regression, Ridge regression, Lasso regression, KNN

regression, support vector regression (SVR), decision tree regression, etc.

● Clustering: K-means clustering, hierarchical clustering, DBSCAN, etc.

Here's a general process for model selection and training:

1. Understand the problem: Identify the task you're trying to solve (classification,

regression, clustering).

2. Explore the data: Analyze the characteristics of your data to see which

algorithms might be suitable.

3. Choose a model: Select an appropriate algorithm based on your understanding

of the problem and the data.

4. Split data: Divide your data into training and testing sets. The training set is used

to train the model, and the testing set is used to evaluate its performance on

unseen data.

5. Train the model: Fit the training data to the chosen algorithm, allowing the model

to learn the patterns within the data.


Model Evaluation (Step 4):

Evaluation helps you assess how well your model performs on unseen data. Common

metrics for evaluation include:

● Classification: Accuracy, precision, recall, F1-score, confusion matrix

● Regression: Mean squared error (MSE), R-squared

● Clustering: Silhouette score, Calinski-Harabasz index

Once you evaluate the model, you can identify its strengths and weaknesses and

determine if it meets your performance requirements.

Improving Performance (Step 5):

If the initial model's performance isn't satisfactory, you can try various techniques to

improve it:

● Hyperparameter Tuning: Many algorithms have parameters that influence their

behavior. Hyperparameter tuning involves adjusting these parameters to optimize

model performance. Scikit-learn provides tools like GridSearchCV and

RandomizedSearchCV to facilitate this process.

● Feature Engineering: Creating new features that capture relevant relationships

within the data can enhance model performance. Feature engineering often

requires domain knowledge about the problem you're tackling.

● Model Selection and Ensemble Techniques: If the initial model choice wasn't

optimal, consider exploring different algorithms or using ensemble methods like

bagging or boosting, which combine the predictions of multiple models.

● Data Augmentation (for classification): Artificial data augmentation techniques

can create synthetic data points to improve model performance, especially when

working with limited datasets.


from sklearn.linear_model import LinearRegression

import numpy as np

house_sizes = np.array([1500, 2000, 2500])

prices = np.array([200000, 250000, 300000]) # Assuming corresponding


prices #Reshape to 2D arrays

house_sizes_reshaped = house_sizes.reshape(-1, 1) #(n_samples, 1)# -1


transpose

# 7 Train the model

model = LinearRegression()

model.fit(house_sizes_reshaped, prices)

#Predict price for a new house size (example)

prediction = model.predict([[40000]])[0]

print("Predicted price for 40000 sqft house:", prediction)

Output :

Reduce Number Randomly

import numpy as np

# Set the random seed for reproducibility


np.random.seed(42)

# Generate random numbers

random_numbers = np.random.rand(5)

# Print the random numbers

print(random_numbers)

Output :

KMeans
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

X,_ = make_blobs(n_samples=300, centers=6, cluster_std=0.60,


random_state=0)
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
plt.scatter(X[:,0],X[:,1], c=y_kmeans, s=50, cmap='viridis')

centers = kmeans.cluster_centers_
plt.scatter(centers[:,0],centers[:,1],c='red', s=200, alpha=0.5)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('KMeans Clustering')
plt.show()

Centers = 4 It will change Centers = 6

Centers = 4 Centers = 6
K-means clustering is a popular unsupervised machine learning algorithm that

groups unlabeled data points into a specific number of clusters. Here's a breakdown of

how it works:

1. Defining the number of clusters (k):

● K-means requires you to predefine the number of clusters (k) that you want the

algorithm to identify in the data. This is an important step as it determines the

structure of the resulting clusters.

2. Initializing centroids:

● The algorithm starts by randomly selecting k data points from the dataset. These

points become the initial representatives or centroids of each cluster.

3. Assigning data points to clusters:

● K-means iteratively assigns each data point in the dataset to the closest centroid

based on a distance metric (usually Euclidean distance).

4. Recomputing centroids:

● Once all data points are assigned to a cluster, the centroids are recalculated. The

new centroid becomes the mean of all the data points within that cluster.

5. Repeating steps 3 and 4:

● Steps 3 and 4 are repeated until a stopping criterion is met. This criterion can be:

○ No data points change clusters between iterations (indicating stable

clusters).

○ A maximum number of iterations is reached.


Overall Objective:

● The main objective of k-means is to minimize the within-cluster variance. This

means data points within a cluster should be as similar as possible to each other,

while data points from different clusters should be dissimilar.

Here are some additional points to consider about k-means clustering:

● Initialization matters: The initial placement of centroids can influence the final

clustering results. Sometimes running the algorithm multiple times with different

initializations can help find better cluster solutions.

● Choosing the right k: There's no perfect way to determine the optimal value of

k. Often, it involves evaluating the clustering results for different k values and

choosing the one that best captures the inherent structure in the data.

● Distance metrics: Euclidean distance is a common choice, but other distance

metrics can be used depending on the data type and application.

● Limitations: K-means assumes spherical clusters and may not work well for

data with irregular shapes or varying densities.

K-means clustering is a versatile algorithm with a wide range of applications, including

customer segmentation, image compression, and anomaly detection.

Classification
There are two types of classification: binary (choosing between two classes) and multiclass
(choosing between more than two classes). In general there are different approaches to the two
types of classification, but most multiclass models will also work for binary problems.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

#Generating synthetic data for binary classification


x, y = make_classification(n_samples=1000, n_features=20, n_classes=2,
random_state=42)
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2,
random_state=42)
# Creating a Logistic Regression model
model = LogisticRegression()
#Traning the model
model.fit(X_train , y_train)
#Making prediction on the test set
y_pred = model.predict(X_test)
#Calculate accuracy
accuracy = accuracy_score (y_test ,y_pred)
print("Accuracy" ,accuracy)

Output :

Accuracy 0.855

Name Subject mark

Pranav Python 50/50

Prasad ML 49/50

Rutwik AI 40/50

Mayur Android dev 35/50

data = [
{"Name": "Pranav", "Subject": "Python", "Mark": "50/50"},
{"Name": "Prasad", "Subject": "ML", "Mark": "49/50"},
{"Name": "Rutwik", "Subject": "AI", "Mark": "40/50"},
{"Name": "Mayur", "Subject": "Android dev", "Mark": "35/50"},
]

# Accessing data
for student in data:
print(f"Name: {student['Name']}, Subject: {student['Subject']}, Mark:
{student['Mark']}")
Convolutional Neural Networks (CNNs)

What is a Convolution Neural Network?


Convolutional Neural Networks (CNNs) are a powerful type of deep learning neural
network architecture specifically designed for image, video, and signal processing tasks.
They excel at identifying patterns and extracting features from data with grid-like
structures, making them highly effective in computer vision applications.
Open Cv ( Computer vision ) : pip install opencv-python

Haar cascade algorithm :

import cv2
# Load the pre-trained Haar cascade classifier for faces

face_cascade =
cv2.CascadeClassifier('C:/Users/Pranav/Downloads/haarcascade_frontalface_d
efault.xml')
#Load the image & Convert the image to grayscale

image =
cv2.imread("C:/Users/Pranav/Downloads/Screenshot2024-06-20145949.png")
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

#Detect faces in the image

faces = face_cascade.detectMultiScale(gray, 1.1, 4)

for (x, y, w, h) in faces: #Draw a rectanglersround each detected face


cv2.rectangle(image, (x, y), (x+w, y+h),(0,255,0),2)

cv2.imshow('Image', image)

cv2.waitKey(0)

cv2.distroyAllwindow()

Output :

Haarcascade_frontalface_default.xml :

Haarcascade_eye.xml :

You might also like