Variance
Variance
data set.variance is the variability in the model prediction—how much the ML function can adjust
depending on the given data set.
Covariance:It is the relationship between a pair of random variables where change in one variable
causes change in another variable.
Correlation:Correlation is a statistical measure that indicates how strongly two variables are related.
Eigenvalues and Eigenvector:eigenvalues and eigenvectors are used to represent data, to perform
operations on data, and to train machine learning models.
In artificial intelligence, eigenvalues and eigenvectors are used to develop algorithms for tasks such as
image recognition, natural language processing, and robotics.
1. Eigenvalue (λ): An eigenvalue of a square matrix A is a scalar (a single number) λ such that
there exists a non-zero vector v (the eigenvector) for which the following equation holds:
Av = λv
In other words, when you multiply the matrix A by the eigenvector v, you get a new vector
that is just a scaled version of v (scaled by the eigenvalue λ).
1.Dimensionality Reduction (PCA): In Principal Component Analysis (PCA), you calculate the
eigenvectors and eigenvalues of the covariance matrix of your data.
2.The eigenvectors (principal components) with the largest eigenvalues capture the most variance in
the data and can be used to reduce the dimensionality of the dataset while preserving important
information.
3.Image Compression: Eigenvectors and eigenvalues are used in techniques like Singular Value
Decomposition (SVD) for image compression.
4.Support vector machines: Support vector machines (SVMs) are a type of machine learning
algorithm that can be used for classification and regression tasks. SVMs work by finding a hyperplane
that separates the data into two classes. The eigenvalues and eigenvectors of the kernel matrix of the
SVM can be used to improve the performance of the algorithm.
Different techniques, such as feature selection and feature extraction, are used to complete
dimensionality reduction.
Why is dimensionality reduction important for machine learning?
Machine learning requires large data sets to properly train and operate. Dimensionality reduction is a
particularly useful way to prevent overfitting and to solve classification and regression problems.
Association: An association rule is an unsupervised learning method which is used for finding the
relationships between variables in the large database. It determines the set of items that occurs
together in the dataset.
Applications of Clustering
In Identification of Cancer Cells: The clustering algorithms are widely used for the identification of
cancerous cells. It divides the cancerous and non-cancerous data sets into different groups.
In Search Engines: Search engines also work on the clustering technique. The search result appears
based on the closest object to the search query. It does it by grouping similar data objects in one
group that is far from the other dissimilar objects. The accurate result of a query depends on the
quality of the clustering algorithm used.
Customer Segmentation: It is used in market research to segment the customers based on their
choice and preferences.
In Biology: It is used in the biology stream to classify different species of plants and animals using the
image recognition technique.
In Land Use: The clustering technique is used in identifying the area of similar lands use in the GIS
database. This can be very useful to find that for what purpose the particular land should be used,
that means for which purpose it is more suitable.
The K-Means clustering requires the analyst to define K number of clusters or the
number of iterations before running the algorithm. As such, it relies heavily on the
analyst’s knowledge to classify the clusters in a meaningful way.
1.The analyst selects the number of K clusters to be used as initial centroids or the
number of iterations for the algorithm to run (AKA: Stopping criterion)
3.With every iteration, the centroid distance for each cluster shifts and is updated
4.The process is repeated until there is no more change in the centroid distance for
each cluster or until the number of iterations is fulfilled
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in
such a way that each dataset belongs only one group that has similar properties.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
Overall, the significance of SVMs lies in their ability to handle high-dimensional data,
their robustness to overfitting, their capability to capture non-linear relationships, and
their solid theoretical foundations, making them a valuable tool in the machine
learning toolkit.