Unit 3 Model Construction 3.1 Machine Learning Concepts - An Overview
Unit 3 Model Construction 3.1 Machine Learning Concepts - An Overview
Model Construction
3.1 Machine Learning Concepts – An Overview
Machine Learning is defined as a technology that is used to train machines to
perform various actions such as predictions, recommendations, estimations, etc.,
based on historical data or past experience.
Machine Learning enables computers to behave like human beings by training
them with the help of past experience and predicted data.
There are three key aspects of Machine Learning, which are as follows:
o Task: A task is defined as the main problem in which we are interested.
This task/problem can be related to the predictions and recommendations
and estimations, etc.
o Experience: It is defined as learning from historical or past data and used
to estimate and resolve future tasks.
o Performance: It is defined as the capacity of any machine to resolve any
machine learning task or problem and provide the best outcome for the
same. However, performance is dependent on the type of machine learning
problems.
How does machine learning work?
Machine learning uses two techniques: supervised learning, which trains a model
on known input and output data to predict future outputs, and unsupervised
learning, which uses hidden patterns or internal structures in the input data.
Machine Learning techniques are divided mainly into the following 4 categories:
1. Supervised Learning
When to use: Use supervised learning if you have known data for the output
you are trying to estimate.
For example,
Supervised learning is applicable when a machine has sample data, i.e., input as
well as output data with correct labels. Correct labels are used to check the
correctness of the model using some labels and tags. Supervised learning
technique helps us to predict future events with the help of past experience and
labeled examples. Initially, it analyses the known training dataset, and later it
introduces an inferred function that makes predictions about output values.
Further, it also predicts errors during this entire learning process and also corrects
those errors through algorithms.
2. Unsupervised Learning
For example, if a cell phone company wants to optimize the locations where they
build towers, they can use machine learning to predict how many people their
towers are based on.
A phone can only talk to 1 tower at a time, so the team uses clustering algorithms
to design the good placement of cell towers to optimize signal reception for their
groups or groups of customers.
Example: Let's assume a machine is trained with some set of documents having
different categories (Type A, B, and C), and we have to organize them into
appropriate groups. Because the machine is provided only with input samples or
without output, so, it can organize these datasets into type A, type B, and type C
categories, but it is not necessary whether it is organized correctly or not.
3. Reinforcement Learning
Reinforcement Learning is a feedback-based machine learning technique. In such
type of learning, agents (computer programs) need to explore the environment,
perform actions, and on the basis of their actions, they get rewards as feedback.
For each good action, they get a positive reward, and for each bad action, they
get a negative reward. The goal of a Reinforcement learning agent is to maximize
Eg.
Imagine a mouse in a maze trying to find hidden pieces of cheese. At first, the
Mouse may move randomly, but after a while, the Mouse's feel helps sense which
actions bring it closer to the cheese. The more times we expose the Mouse to the
maze, the better at finding the cheese.
You can use RL when you have little or no historical data about a problem, as
it does not require prior information (unlike traditional machine learning
methods). In the RL framework, you learn from the data as you go. Not
surprisingly, RL is particularly successful with games, especially games of
"correct information" such as chess and Go. With games, feedback from the agent
and the environment comes quickly, allowing the model to learn faster. The
downside of RL is that it can take a very long time to train if the problem is
complex.
4. Semi-supervised Learning
The goal of the linear regression algorithm is to get the best values for a0 and a1
to find the best fit line. The best fit line should have the least error means the error
between predicted values and actual values should be minimized.
For the above linear equation, MSE can be calculated using below formula:
The Goodness of fit determines how the line of regression fits the set of
observations. The process of finding the best model out of various models is
called optimization. It can be achieved by below method:
R-squared method:
Example of SLR
Example https://www.javatpoint.com/simple-linear-regression-in-machine-learning
The key point in Simple Linear Regression is that the dependent variable must be
a continuous/real value. However, the independent variable can be measured on
continuous or categorical values.
Here we are taking a dataset that has two variables: salary (dependent variable)
and experience (Independent variable). The goal of this problem is:
o We want to find out if there is any correlation between these two variables.
o We will find the best fit line for the dataset.
o How the dependent variable is changing by changing the independent variable.
Solution:
The first step for creating the Simple Linear Regression model is data pre-
processing. We have already done it earlier in this tutorial. But there will be some
changes, which are given in the below steps:
o First, we will import the three important libraries, which will help us for
loading the dataset, plotting the graphs, and creating the Simple Linear
Regression model.
o Next, we will load the dataset into our code:
Now the second step is to fit our model to the training dataset. To do so,
we will import the LinearRegression class of the linear_model library
from the scikit learn. After importing the class, we are going to create an
object of the class named as a regressor. In the above code, we have used
a fit() method to fit our Simple Linear Regression object to the training set.
In the fit() function, we have passed the x_train and y_train, which is our
training dataset for the dependent and an independent variable. We have
fitted our regressor object to the training set so that the model can easily
learn the correlations between the predictor and target variables. After
executing the above lines of code, we will get the below output.
We will create a prediction vector y_pred, and x_pred, which will contain
predictions of test dataset, and prediction of training set respectively. On
executing the above lines of code, two variables named y_pred and x_pred will
generate in the variable explorer options that contain salary predictions for the
training set and test set.
Now in this step, we will visualize the training set result. To do so, we will use
the scatter() function of the pyplot library, which we have already imported in the
pre-processing step. The scatter () function will create a scatter plot of
observations.
In the x-axis, we will plot the Years of Experience of employees and on the y-
axis, salary of employees. In the function, we will pass the real values of training
set, which means a year of experience x_train, training set of Salaries y_train, and
color of the observations. Here we are taking a green color for the observation,
but it can be any color as per the choice.
Now, we need to plot the regression line, so for this, we will use the plot()
function of the pyplot library. In this function, we will pass the years of
experience for training set, predicted salary for training set x_pred, and color of
the line.
Next, we will give the title for the plot. So here, we will use the title() function of
the pyplot library and pass the name ("Salary vs Experience (Training Dataset)".
After that, we will assign labels for x-axis and y-axis using xlabel() and ylabel()
function.
Output:
By executing the above lines of code, we will get the below graph plot as an
output.
# Extract the dependent and independent variables from the given dataset. The independent
variable is years of experience, and the dependent variable is salary.
x= data_set.iloc[:, :-1].values
y= data_set.iloc[:, 1].values
The multiple regression equation explained above takes the following form:
Here, bi’s (i=1,2…n) are the regression coefficients, which represent the value at
which the criterion variable changes when the predictor variable changes. where
x1, x2, ….xk are the k independent variables and y is the dependent variable
As an example, let’s say that the test score of a student in an exam will be
dependent on various factors like his focus while attending the class, his intake
of food before the exam and the amount of sleep he gets before the
exam. Using this test one can estimate the appropriate relationship among these
factors.
Multiple regression is like linear regression, but with more than one independent
value, meaning that we try to predict a value based on two or more variables.
https://www.w3schools.com/python/python_ml_multiple_regression.asp
We can predict the CO2 emission of a car based on the size of the engine, but
with multiple regression we can throw in more variables, like the weight of the
car, to make the prediction more accurate. R(Car, Model,Volume,Weight,CO2)
PredictedCO2 = 107.208gms
We have predicted that a car with 1.3 liter engine, and a weight of 2300 kg, will
release approximately 107 grams of CO2 for every kilometer it drives.
We have already predicted that if a car with a 1300cm3 engine weighs 2300kg,
the CO2 emission will be approximately 107g. What if we increase the weight
with 1000kg?
In this case, we can ask for the coefficient value of weight against CO2, and for
volume against CO2. The answer(s) we get tells us what would happen if we
increase, or decrease, one of the independent values.
The result array represents the coefficient values of weight and volume.
Weight:0.00755095
Volume: 0.00780526
These values tell us that if the weight increase by 1kg, the CO2 emission increases
by 0.00755095g.
And if the engine size (Volume) increases by 1 cm3, the CO2 emission increases
by 0.00780526 g.
Suppose there are two categories, i.e., Category A and Category B, and we have
a new data point x1, so this data point will lie in which of these categories. To
solve this type of problem, we need a K-NN algorithm. With the help of K-NN,
we can easily identify the category or class of a particular dataset. Consider
the below diagram:
o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
o Large values for K are good, but it may find some difficulties.
o It is recommended to choose an odd value for k to avoid ties in classification.
o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points
for all the training samples.
KNN Example
https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-learning
• Euclidean Distance
• Manhattan Distance
Types of Cross-validation
1. K-Fold Cross-Validation: The data is divided into K subsets (or “folds”). The model is
trained K times, using K-1 folds for training and one-fold for testing in each iteration.
2. Leave-One-Out Cross-Validation (LOOCV): K-Fold CV with K equal to the number of
data points, i.e., each data point is used once as a test set, and the model is trained K times.
3. Stratified K-Fold Cross-Validation: It ensures that the class distribution remains similar
in each fold, important when dealing with imbalanced datasets.
4. Time Series Cross-Validation: For time-dependent data, it uses a series of temporally
ordered training and testing sets, preventing the use of future data for training.
5. Shuffle-Split Cross-Validation: Randomly shuffles the data, and then splits it into
training and testing sets multiple times.
6. Group K-Fold Cross-Validation: Useful when the data contains groups, like multiple
A model with high variance may represent the data set accurately but could lead to
overfitting to noisy or otherwise unrepresentative training data. In comparison, a model with
high bias may underfit the training data due to a simpler model that overlooks regularities in
the data. Bias creates consistent errors in the ML model, which represents a simpler ML model
that is not suitable for a specific requirement. On the other hand, variance creates variance
errors that lead to incorrect predictions seeing trends or data points that do not exist.
What is Overfitting?
Overfitting is an undesirable machine learning behavior that occurs when the machine learning
model gives accurate predictions for training data but not for new data. When data
scientists use machine learning models for making predictions, they first train the model on a
known data set. Then, based on this information, the model tries to predict outcomes for new
data sets. An overfit model can give inaccurate predictions and cannot perform well for all
types of new data.
• The training data size is too small and does not contain enough data samples to accurately
represent all possible input data values.
• The training data contains large amounts of irrelevant information, called noisy data.
• The model trains for too long on a single sample set of data.
• The model complexity is high, so it learns the noise within the training data.
What is underfitting?
Underfitting is another type of error that occurs when the model cannot determine a meaningful
relationship between the input and output data. You get underfit models if they have not trained
for the appropriate length of time on a large number of data points.
Regression Line: If our data shows a linear relationship between X and Y, then the straight line
which best describes the relationship is the regression line. It is the straight line that covers the
maximum points in the graph.
• y = how far up
• x = how far along
• m = Slope or Gradient (how steep the line is)
• b = the Y Intercept (where the line crosses the Y axis)
Where xi stands for data values, x bar is the mean value and n
is the sample size.
The standard error of estimate measures the accuracy of the predictions made by a
regression model. In other words, it determines how well the regression line describes the
values of a data set. If you have a collection of data from an experiment, survey, or other
source, follow along with us below to learn how to calculate your data set’s standard error of
estimate.
Example
Consider the following Data pairs: (1,2) (2,4) (3,5) (4,4) (5,5)
Solution: Calculate the Regression line (as seen in previous example) and populate the table.
R-squared method:
Below is a graph showing how the number lectures per day affects the number of hours spent
at university per day. The equation of the regression line is drawn on the graph and it has the
below equation
Solution:
For the point (2,2), sub the values on the regression line. The predicted Y is computed as
follows.
The actual value of Y in (2,2) is 2. Residual is computed as the difference between actual and
predicted values of Y.
Step 2: Compute:
Step 4:
Step 5:
Regression to the mean refers to the idea that rare or extreme events are
likely to be followed by more typical ones. Over time, outcomes regress to the
average or mean.
Consider two students, Jane and Joe. In year one, Jane does horribly but Joe
is outstanding. Jane is ranked in the bottom 1 percent while Joe is ranked at
the top 99 percent. If their results were entirely due to talent, there would be
no regression. Jane should be as bad in year two, and Joe should be as good
in year two. This is represented as possibility A in the diagram below. If their
results were equal parts luck and talent, we would expect halfway regression:
Jane should rise to around 25 percent and Joe should fall to around 75
percent, possibility B. If their results were caused entirely by luck (e.g.
flipping a coin), then in year two we would expect both Jane and Joe to
regress all the way back to 50 percent, possibility C.