Unit 1 Machine Learning
Unit 1 Machine Learning
UNIT 1
1. Data Collection:
First, relevant data is collected or curated. This data could include examples, features, or
attributes that are important for the task at hand, such as images, text, numerical data, etc.
2. Data Preprocessing:
Before feeding the data into the algorithm, it often needs to be pre-processed. This step may
involve cleaning the data (handling missing values, outliers), transforming the data
(normalization, scaling), and splitting it into training and test sets.
3. Choosing a Model:
Depending on the task (e.g., classification, regression, clustering), a suitable machine
learning model is chosen. Examples include decision trees, neural networks, support vector
machines, and more advanced models like deep learning architectures.
6. Fine-tuning:
Models may be fine-tuned by adjusting hyperparameters (parameters that are not directly
learned during training, like learning rate or number of hidden layers in a neural network) to
improve performance.
7. Prediction or Inference:
Finally, the trained model is used to make predictions or decisions on new data. This process
involves applying the learned patterns to new inputs to generate outputs, such as class labels
in classification tasks or numerical values in regression tasks.
2. Data Collection:
When the problem is well-defined, we can collect the relevant data required for the model.
The data could come from various sources such as databases, APIs, or web scraping.
3. Data Preparation:
When our problem-related data is collected. then it is a good idea to check the data properly
and make it in the desired format so that it can be used by the model to find the hidden
patterns.
4. Model Selection:
The next step is to select the appropriate machine learning algorithm that is suitable for our
problem. This step requires knowledge of the strengths and weaknesses of different
algorithms. Sometimes we use multiple models and compare their results and select the best
model as per our requirements.
6. Model Evaluation:
Once the model is trained, it can be evaluated on the test dataset to determine its accuracy and
performance using different techniques. like classification report, F1 score, precision, recall,
ROC Curve, Mean Square error, absolute error, etc.
7. Model Tuning:
Based on the evaluation results, the model may need to be tuned or optimized to improve its
performance. This involves tweaking the hyperparameters of the model.
8. Deployment:
Once the model is trained and tuned, it can be deployed in a production environment to make
predictions on new data. This step requires integrating the model into an existing software
system or creating a new system for the model.
6. Slow Implementation
The machine learning models are highly efficient in providing accurate results, but it takes a
tremendous amount of time. Slow programs, data overload, and excessive requirements
usually take a lot of time to provide accurate results. Further, it requires constant monitoring
and maintenance to deliver the best output.
● Clustering: Clustering algorithms group similar data points together based on their
characteristics. The goal is to identify groups, or clusters, of data points that are
similar to each other, while being distinct from other groups. Some popular
clustering algorithms include K-means, Hierarchical clustering, and DBSCAN.
● Dimensionality reduction: Dimensionality reduction algorithms reduce the
number of input variables in a dataset while preserving as much of the original
information as possible. This is useful for reducing the complexity of a dataset and
making it easier to visualize and analyze. Some popular dimensionality reduction
algorithms include Principal Component Analysis (PCA), t-SNE, and
Autoencoders.
Here are some specific areas where machine learning is being used:
● Predictive modeling: Machine learning can be used to build predictive models that
can help businesses make better decisions. For example, machine learning can be
used to predict which customers are most likely to buy a particular product, or which
patients are most likely to develop a certain disease.
● Natural language processing: Machine learning is used to build systems that can
understand and interpret human language. This is important for applications such
as voice recognition, chatbots, and language translation.
● Computer vision: Machine learning is used to build systems that can recognize and
interpret images and videos. This is important for applications such as self-driving
cars, surveillance systems, and medical imaging.
● Fraud detection: Machine learning can be used to detect fraudulent behavior in
financial transactions, online advertising, and other areas.
Overall, machine learning has become an essential tool for many businesses and industries, as
it enables them to make better use of data, improve their decision-making processes, and deliver
more personalized experiences to their customers.
2. Matrices
● Matrices are rectangular arrays of numbers, arranged in rows and columns.
● Matrices are used to represent linear transformations, systems of linear equations,
and data transformations in machine learning.
3. Scalars
● Scalars are single numerical values, without direction, magnitude only.
● Scalars are used to scale vectors or matrices through operations like
multiplication.
C. Operations in Linear Algebra
1. Addition and Subtraction
● Addition and subtraction of vectors or matrices involve adding or subtracting
corresponding elements.
2. Scalar Multiplication
● Scalar multiplication involves multiplying each element of a vector or
matrix by a scalar.
Linear Transformations
Linear transformations are fundamental operations in linear algebra that involve the
transformation of vectors and matrices while preserving certain properties such as linearity and
proportionality. In the context of machine learning, linear transformations play a crucial role
in data preprocessing, feature engineering, and model training. In this section, we explore the
definition, types, and applications of linear transformations.
A. Matrix Multiplication
C. Determinants
The determinant of a square matrix is a scalar value that encodes various properties of the
matrix, such as its volume, orientation, and invertibility.
● Significance: The determinant is used to determine whether a matrix is invertible,
calculate the volume of parallelepiped spanned by vectors, and analyze the
stability of numerical algorithms.
● Properties: The determinant satisfies several properties, including linearity,
multiplicativity, and the property that a matrix is invertible if and only if its
determinant is non-zero.
2. Eigenvectors:
● Eigenvectors are non-zero vectors that are transformed by a matrix
only by a scalar factor, known as the eigenvalue.
● They represent the directions in which a linear transformation
represented by a matrix stretches or compresses space.
● Eigenvectors corresponding to distinct eigenvalues are linearly
independent and form a basis for the vector space.
C. Linear Regression
Linear regression is a supervised learning algorithm used for modeling the relationship between
a dependent variable and one or more independent variables. Linear algebra plays a crucial role
in solving the linear regression problem efficiently through techniques such as:
1. Matrix Formulation:
E. Neural Networks
Neural networks, especially deep learning models, heavily rely on linear algebra for model
representation, parameter optimization, and forward/backward propagation. Key linear
algebraic operations in neural networks include:
1. Matrix Multiplication:
● Performing matrix multiplication operations between input features
and weight matrices in different layers of the neural network during the
forward pass.
2. Gradient Descent:
● Computing gradients efficiently using backpropagation and updating
network parameters using gradient descent optimization algorithms,
which involve various linear algebraic operations.
3. Weight Initialization:
● Initializing network weights using techniques such as Xavier
initialization and He initialization, which rely on linear algebraic
properties for proper scaling of weight matrices.
What is a Dataset?
A Dataset is a set of data grouped into a collection with which developers can work to meet
their goals. In a dataset, the rows represent the number of data points and the columns represent
the features of the Dataset. They are mostly used in fields like machine learning, business, and
government to gain insights, make informed decisions, or train algorithms. Datasets may vary
in size and complexity and they mostly require cleaning and preprocessing to ensure data
quality and suitability for analysis or modeling.
Datasets can be stored in multiple formats. The most common ones are CSV, Excel, JSON,
and zip files for large datasets such as image datasets.
Types of Datasets
There are various types of datasets available out there. They are:
● Numerical Dataset: They include numerical data points that can be solved with
equations. These include temperature, humidity, marks and so on.
● Categorical Dataset: These include categories such as colour, gender, occupation,
games, sports and so on.
● Web Dataset: These include datasets created by calling APIs using HTTP requests
and populating them with values for data analysis. These are mostly stored in JSON
(JavaScript Object Notation) formats.
● Time series Dataset: These include datasets between a period, for example,
changes in geographical terrain over time.
● Image Dataset: It includes a dataset consisting of images. This is mostly used to
differentiate the types of diseases, heart conditions and so on.
● Ordered Dataset: These datasets contain data that are ordered in ranks, for
example, customer reviews, movie ratings and so on.
● Partitioned Dataset: These datasets have data points segregated into different
members or different partitions.
● File-Based Datasets: These datasets are stored in files, in Excel as .csv, or .xlsx
files.
● Bivariate Dataset: In this dataset, 2 classes or features are directly correlated to
each other. For example, height and weight in a dataset are directly related to each
other.
● Multivariate Dataset: In these types of datasets, as the name suggests 2 or more
classes are directly correlated to each other. For example, attendance, and
assignment grades are directly correlated to a student’s overall grade.
Data Preprocessing
Pre-processing refers to the transformations applied to our data before feeding it to the
algorithm. Data preprocessing is a technique that is used to convert the raw data into a clean
data set. In other words, whenever the data is gathered from different sources it is collected in
raw format which is not feasible for the analysis.
Data Preprocessing
Need of Data Preprocessing
● For achieving better results from the applied model in Machine Learning projects
the format of the data has to be in a proper manner. Some specified Machine
Learning model needs information in a specified format, for example, Random
Forest algorithm does not support null values, therefore to execute random forest
algorithm null values have to be managed from the original raw data set.
● Another aspect is that the data set should be formatted in such a way that more
than one Machine Learning and Deep Learning algorithm are executed in one data
set, and best out of them is chosen.
What is Bias?
Bias is simply defined as the inability of the model because of that there is some difference or
error occurring between the model’s predicted value and the actual value. These differences
between actual or expected values and the predicted values are known as error or bias error or
error due to bias. Bias is a systematic error that occurs due to wrong assumptions in the machine
learning process.
● Low Bias: Low bias value means fewer assumptions are taken to build the target
function. In this case, the model will closely match the training dataset.
● High Bias: High bias value means more assumptions are taken to build the target
function. In this case, the model will not match the training dataset closely.
The high-bias model will not be able to capture the dataset trend. It is considered as the
underfitting model which has a high error rate. It is due to a very simplified algorithm.
For example, a linear regression model may have a high bias if the data has a non-linear
relationship.
What is Variance?
Variance is the measure of spread in data from its mean position. In machine learning variance
is the amount by which the performance of a predictive model changes when it is trained on
different subsets of the training data. More specifically, variance is the variability of the model
that how much it is sensitive to another subset of the training dataset. i.e. how much it can
adjust on the new subset of the training dataset.
● Low variance: Low variance means that the model is less sensitive to changes in
the training data and can produce consistent estimates of the target function with
different subsets of data from the same distribution. This is the case of underfitting
when the model fails to generalize on both training and test data.
● High variance: High variance means that the model is very sensitive to changes
in the training data and can result in significant changes in the estimate of the
target function when trained on different subsets of data from the same
distribution. This is the case of overfitting when the model performs well on the
training data but poorly on new, unseen test data. It fits the training data too
closely that it fails on the new training dataset.
Function Approximation
Function approximation is a critical concept in reinforcement learning (RL), enabling
algorithms to generalize from limited experience to a broader set of states and actions. This
capability is essential when dealing with complex environments where the state and action
spaces are vast or continuous.
1. Handling Complexity: In many real-world problems, the state and action spaces
are too vast to enumerate or store explicitly. Function approximation allows RL
algorithms to represent value functions or policies compactly using parameterized
functions.
2. Generalization: Function approximation enables RL agents to generalize from
limited experience to unseen states and actions. This is crucial for robust
performance in environments where exhaustive exploration is impractical.
3. Efficiency: By approximating value functions or policies, RL algorithms can
operate efficiently even in high-dimensional spaces. This efficiency is essential
for scaling RL to complex tasks such as robotic control or game playing.
○ Overfitting & underfitting are the two main errors/problems in the machine learning
model, which cause poor performance in Machine Learning.
○ Overfitting occurs when the model fits more data than required, and it tries to capture
each and every datapoint fed to it. Hence it starts capturing noise and inaccurate data
from the dataset, which degrades the performance of the model.
○ An overfitted model doesn't perform accurately with the test/unseen dataset and can’t
generalize well.
○ An overfitted model is said to have low bias and high variance.
Now, if the model performs well with the training dataset but not with the test dataset, then it
is likely to have an overfitting issue.
For example, if the model shows 85% accuracy with training data and 50% accuracy with the
test dataset, it means the model is not performing well.
Early Stopping
In this technique, the training is paused before the model starts learning the noise within the
model. In this process, while training the model iteratively, measure the performance of the
model after each iteration. Continue up to a certain number of iterations until a new iteration
improves the performance of the model.
After that point, the model begins to overfit the training data; hence we need to stop the process
before the learner passes that point.
Stopping the training process before the model starts capturing noise from the data is known
as early stopping.
However, this technique may lead to the underfitting problem if training is paused too early.
So, it is very important to find that "sweet spot" between underfitting and overfitting.
Increasing the training set by including more data can enhance the accuracy of the model, as it
provides more chances to discover the relationship between input and output variables.
It may not always work to prevent overfitting, but this way helps the algorithm to detect the
signal better to minimise the errors.
When a model is fed with more training data, it will be unable to overfit all the samples of data
and forced to generalise well.
But in some cases, the additional data may add more noise to the model; hence we need to be
sure that data is clean and free from in-consistencies before feeding it to the model.
Feature Selection
While building the ML model, we have a number of parameters or features that are used to
predict the outcome. However, sometimes some of these features are redundant or less
important for the prediction, and for this feature selection process is applied. In the feature
selection process, we identify the most important features within training data, and other
features are removed. Further, this process helps to simplify the model and reduces noise from
the data. Some algorithms have the auto-feature selection, and if not, then we can manually
perform this process.
Cross-Validation
Cross-validation is one of the powerful techniques to prevent overfitting.
In the general k-fold cross-validation technique, we divided the dataset into k-equal-sized
subsets of data; these subsets are known as folds.
Data Augmentation
Data Augmentation is a data analysis technique, which is an alternative to adding more data to
prevent overfitting. In this technique, instead of adding more training data, slightly modified
copies of already existing data are added to the dataset.
The data augmentation technique makes it possible to appear data sample slightly different
every time it is processed by the model. Hence each data set appears unique to the model and
prevents overfitting.
Regularization
If overfitting occurs when a model is complex, we can reduce the number of features. However,
overfitting may also occur with a simpler model, more specifically the Linear model, and for
such cases, regularization techniques are much helpful.
Regularization is the most popular technique to prevent overfitting. It is a group of methods
that forces the learning algorithms to make a model simpler. Applying the regularization
technique may slightly increase the bias but slightly reduces the variance. In this technique, we
modify the objective function by adding the penalizing term, which has a higher value with a
more complex model.
The two commonly used regularization techniques are L1 Regularization and L2
Regularization.
Ensemble Methods
In ensemble methods, prediction from different machine learning models is combined to
identify the most popular result.
The most commonly used ensemble methods are Bagging and Boosting.
In bagging, individual data points can be selected more than once. After the collection of
several sample datasets, these models are trained independently, and depending on the type of
task-i.e., regression or classification-the average of those predictions is used to predict a more
accurate result. Moreover, bagging reduces the chances of overfitting in complex models.
In boosting, a large number of weak learners arranged in a sequence are trained in such a way
that each learner in the sequence learns from the mistakes of the learner before it. It combines
all the weak learners to come out with one strong learner. In addition, it improves the predictive
flexibility of simple models.