Machine Learnin-WPS Office PDF
Machine Learnin-WPS Office PDF
Machine Learnin-WPS Office PDF
ROLLno.- 30001220007
ROLLno.- 30001220025
ROLLno.- 30001220047
ROLLno.- 30001220037
ROLLno.- 30001220045
Machine Learning
ABSTRACT :
Introduction
In this blog, we will discuss the workflow of a Machine learning project this
includes all the steps required to build the proper machine learning project from
scratch.
We will also go over data pre-processing, data cleaning, feature exploration and
feature engineering and show the impact that it has on Machine Learning Model
Performance. We will also cover a couple of the pre-modelling steps that can help
to improve the model performance.
1. Numpy
2. Pandas
3. Sci-kit Learn
4. Matplotlib
Gathering data
Data pre-processing
Researching the model that will be best for the type of data
Evaluation
1. Gathering Data
The process of gathering data depends on the type of project we desire to make,
if we want to make an ML project that uses real-time data, then we can build an
IoT system that using different sensors data. The data set can be collected from
various sources such as a file, database, sensor and many other such sources but
the collected data cannot be used directly for performing the analysis process as
there might be a lot of missing data, extremely large values, unorganized text
data or noisy data. Therefore, to solve this problem Data Preparation is done.
We can also use some free data sets which are present on the internet. Kaggle
and UCI Machine learning Repository are the repositories that are used the most
for making Machine learning models. Kaggle is one of the most visited websites
that is used for practicing machine learning algorithms, they also host
competitions in which people can participate and get to test their knowledge of
machine learning.
2. Data pre-processing
Data pre-processing is one of the most important steps in machine learning. It is
the most important step that helps in building machine learning models more
accurately. In machine learning, there is an 80/20 rule. Every data scientist should
spend 80% time for data pre-processing and 20% time to actually perform the
analysis.
Therefore, certain steps are executed to convert the data into a small clean data
set, this part of the process is called as data pre-processing.
Most of the real-world data is messy, some of these types of data are:
1. Missing data: Missing data can be found when it is not continuously created or
due to technical issues in the application (IOT system).
2. Noisy data: This type of data is also called outliners, this can occur due to
human errors (human manually gathering the data) or some technical problem of
the device at the time of collection of data.
3. Inconsistent data: This type of data might be collected due to human errors
(mistakes with the name or values) or duplication of data.
1. Conversion of data: As we know that Machine Learning models can only handle
numeric features, hence categorical and ordinal data must be somehow
converted into numeric features.
2. Ignoring the missing values: Whenever we encounter missing data in the data
set then we can remove the row or column of data depending on our need. This
method is known to be efficient but it shouldn’t be performed if there are a lot of
missing values in the dataset.
3. Filling the missing values: Whenever we encounter missing data in the data set
then we can fill the missing data manually, most commonly the mean, median or
highest frequency value is used.
4. Machine learning: If we have some missing data then we can predict what data
shall be present at the empty position by using the existing data.
5. Outliers detection: There are some error data that might be present in our
data set that deviates drastically from other observations in a data set. [Example:
human weight = 800 Kg; due to mistyping of extra 0].
3. Researching the model that will be best for the type of data
Our main goal is to train the best performing model possible, using the pre-
processed data.
Supervised Learning:
Classification:
Classification problem is when the target variable is categorical (i.e. the output
could be classified into classes — it belongs to either Class A or B or something
else).
1) K-Nearest Neighbor
2) Naive Bayes
5) Logistic Regression
Regression:
While a Regression problem is when the target variable is continuous (i.e. the
output is numeric).
Linear Regression
1) Support Vector Regression
4) Ensemble Methods
Unsupervised Learning:
Clustering:
A set of inputs is to be divided into groups. Unlike in classification, the
groups are not known beforehand, making this typically an
unsupervised task.
1) Gaussian mixtures
2) K-Means Clustering
3) Boosting
4) Hierarchical Clustering
5) K-Means Clustering
6) Spectral Clustering
You train the classifier using ‘training data set’, tune the parameters
using ‘validation set’ and then test the performance of your classifier on
unseen ‘test data set’. An important point to note is that during training
the classifier only the training and/or validation set is available. The test
data set must not be used during training the classifier. The test set will
only be available during testing the classifier.
Training set: The training set is the material through which the
computer learns how to process information. Machine learning uses
algorithms to perform the training part. A set of data used for learning,
that is to fit the parameters of the classifier.
Test set: A set of unseen data used only to assess the performance of
a fully-specified classifier.
Once the data is divided into the 3 given segments we can start the
training process.
In a data set, a training set is implemented to build up a model, while a
test (or validation) set is to validate the model built. Data points in the
training set are excluded from the test (validation) set. Usually, a data
set is divided into a training set, a validation set (some people use ‘test
set’ instead) in each iteration, or divided into a training set, a validation
set and a test set in each iteration.
The model uses any one of the models that we had chosen in step 3/
point 3. Once the model is trained we can use the same trained model
to predict using the testing data i.e. the unseen data. Once this is done
we can develop a confusion matrix, this tells us how well our model is
trained. A confusion matrix has 4 parameters, which are ‘True
positives’, ‘True Negatives’, ‘False Positives’ and ‘False Negative’. We
prefer that we get more values in the True negatives and true positives
to get a more accurate model. The size of the Confusion matrix
completely depends upon the number of classes.
True positives : These are cases in which we predicted TRUE and our
predicted output is correct.
We can also find out the accuracy of the model using the confusion
matrix.
Accuracy = (True Positives +True Negatives) / (Total number of classes)
5. Evaluation
Model Evaluation is an integral part of the model development process.
It helps to find the best model that represents our data and how well
the chosen model will work in the future.To improve the model we
might tune the hyper-parameters of the model and try to improve the
accuracy and also looking at the confusion matrix to try to increase the
number of true positives and true negatives.
Conclusion
In this blog, we have discussed the workflow a Machine learning project
and gives us a basic idea of how a should the problem be tackled.