Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
o Weak AI
o Generative AI
o Strong AI:- The future of AI is Strong AI for which it is said that it will be
intelligent than humans.
Machine Learning Tutorial
1.supervised Leaning
2.Un Supervised Learning
3.Reinforcement Learning
Supervised learning can be grouped further in two categories of algorithms:
o Classification
o Regression
o Clustering
o Association
Machine learning life cycle involves seven major steps, which are given
below:
o Gathering Data
o Data preparation
o Data Wrangling
o Analyse Data
o Train the model
o Test the model
o Deployment
Gathering Data
Data Preparation
o Data exploration:
o Data pre-processing:
Data wrangling is the process of cleaning and converting raw data into a useable
format. It is the process of cleaning the data, selecting the variable to use, and
transforming the data in a proper format to make it more suitable for analysis in
the next step.
It is mandatory to detect and remove the above issues because it can negatively
affect the quality of the outcome.
Now the cleaned and prepared data is passed on to the analysis step. This
step involves:
The aim of this step is to build a machine learning model to analyse the data
using various analytical techniques and review the outcome. It starts with the
determination of the type of the problems, where we select the machine learning
techniques such as Classification, Regression, Cluster
analysis, Association, etc. then build the model using prepared data, and
evaluate the model.
Now the next step is to train the model, in this step we train our model to
improve its performance for better outcome of the problem. Training a model is
required so that it can understand the various patterns, rules, and, features
Once our machine learning model has been trained on a given dataset, then we
test the model. In this step, we check for the accuracy of our model by providing
a test dataset to it.
Types of datasets
Examples :
Tabular Datasets:
o Tabular datasets are organized information coordinated in tables or
calculation sheets. They contain lines addressing examples or tests
and segments addressing highlights or qualities. Tabular datasets
are utilized for undertakings like relapse and arrangement. The
dataset given before in the article is an illustration of a tabular
dataset.
o Training dataset:
o Test Dataset
1. data_set= pd.read_csv('Dataset.csv')
In our dataset, there are three independent variables that are Country, Age,
and Salary, and one is a dependent variable which is Purchased.
1. x= data_set.iloc[:,:-1].values
In the above code, the first colon(:) is used to take all the rows, and the second
colon(:) is for all the columns. Here we have used :-1, because we don't want to
take the last column as it contains the dependent variable. So by doing this, we
will get the matrix of features.
1. y= data_set.iloc[:,3].values
2.
Ways to handle missing data:
3. There are mainly two ways to handle missing data, which are:
7. #handling missing data (Replacing missing data with the mean valu
e)
8. from sklearn.preprocessing import Imputer
9. imputer= Imputer(missing_values ='NaN', strategy='mean', axis =
0)
10. #Fitting imputer object to the independent variables x.
11. imputerimputer= imputer.fit(x[:, 1:3])
12. #Replacing missing data with the calculated mean value
13. x[:, 1:3]= imputer.transform(x[:, 1:3])
As we can see in the above output, the missing values have been replaced with
the means of rest column values.
1. #Catgorical data
2. #for Country Variable
3. from sklearn.preprocessing import LabelEncoder
4. label_encoder_x= LabelEncoder()
5. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
6. Explanation:
Dummy Variables:
1. labelencoder_y= LabelEncoder()
2. y= labelencoder_y.fit_transform(y)
If we train our model very well and its training accuracy is also very high, but we
provide a new dataset to it, then it will decrease the performance. So we always
try to make a machine learning model which performs well with the training set
and also with the test dataset. Here, we can define these datasets as:
For splitting the dataset, we will use the below lines of code:
Explanation:
o In the above code, the first line is used for splitting arrays of the
dataset into random train and test subsets.
o In the second line, we have used four variables for our output that
are
o x_train: features for the training data
o x_test: features for testing data
o y_train: Dependent variables for training data
o y_test: Independent variable for testing data
o In train_test_split() function, we have passed four parameters in
which first two are for arrays of data, and test_size is for specifying
the size of the test set. The test_size maybe .5, .3, or .2, which tells
the dividing ratio of training and testing sets.
o The last parameter random_state is used to set a seed for a
random generator so that you always get the same result, and the
most used value for this is 42.
7) Feature Scaling
o Feature scaling is the final step of data preprocessing in machine
learning. It is a technique to standardize the independent variables
of the dataset in a specific range. In feature scaling, we put our
variables in the same range and in the same scale so that no any
variable dominate the other variable.
o As we can see, the age and salary column values are not on the
same scale. A machine learning model is based on Euclidean
distance, and if we do not scale the variable, then it will cause
some issue in our machine learning model.
1. st_x= StandardScaler()
2. x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
Now, in the end, we can combine all the steps together to make our
complete code more understandable.
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('Dataset.csv')
8.
9. #Extracting Independent Variable
10. x= data_set.iloc[:, :-1].values
11.
12. #Extracting Dependent variable
13. y= data_set.iloc[:, 3].values
14.
15. #handling missing data(Replacing missing data with the mean
value)
16. from sklearn.preprocessing import Imputer
17. imputer= Imputer(missing_values ='NaN', strategy='mean', a
xis = 0)
18.
19. #Fitting imputer object to the independent varibles x.
20. imputerimputer= imputer.fit(x[:, 1:3])
21.
22. #Replacing missing data with the calculated mean value
23. x[:, 1:3]= imputer.transform(x[:, 1:3])
24.
25. #for Country Variable
26. from sklearn.preprocessing import LabelEncoder, OneHotEnco
der
27. label_encoder_x= LabelEncoder()
28. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
29.
30. #Encoding for dummy variables
31. onehot_encoder= OneHotEncoder(categorical_features= [0])
32. x= onehot_encoder.fit_transform(x).toarray()
33.
34. #encoding for purchased variable
35. labelencoder_y= LabelEncoder()
36. y= labelencoder_y.fit_transform(y)
37.
38. # Splitting the dataset into training and test set.
39. from sklearn.model_selection import train_test_split
40. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size=
0.2, random_state=0)
41.
42. #Feature Scaling of datasets
43. from sklearn.preprocessing import StandardScaler
44. st_x= StandardScaler()
45. x_train= st_x.fit_transform(x_train)
46. x_test= st_x.transform(x_test)
Supervised Learning Dataset
1. Regression
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification
Spam Filtering,
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.
Types of regression
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:
Linear regression is a straightforward statistical method used for predictive analysis,
modeling the linear relationship between continuous variables. It can be simple (with one
input variable) or multiple (with more than one input variable). The model helps predict
values, such as an employee's salary based on years of experience, by showing how
changes in independent variables affect the dependent variable.
1. Y= aX+b
o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)
o Polynomial Regression is a type of regression which models
the non-linear dataset using a linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve
between the value of x and corresponding conditional values
Note: This is different from Multiple Linear regression in such a way that in Polynomial
regression, a single element has different degrees instead of multiple variables with the
same degree.
Support Vector Regression(SVR) is a regression algorithm which works for
continuous variables. Keywords used in SVR.
Here, the blue line is called hyperplane, and the other two lines are known
as boundary lines:
Decision Tree Regression is a supervised learning algorithm used for both classification and
regression problems, handling categorical and numerical data. It constructs a tree-like
structure where each internal node tests an attribute, branches represent outcomes, and leaf
nodes provide the final prediction. The tree starts from a root node and recursively splits into
child nodes until reaching a decision.
o Random forest is one of the most powerful supervised learning
algorithms which is capable of performing regression as well as
classification tasks.
o The Random Forest regression is an ensemble learning method
which combines multiple decision trees and predicts the final output
based on the average of each tree output. The combined decision
trees are called as base models, and it can be represented more
formally as:
o g(x)= f0(x)+ f1(x)+ f2(x)+....
Ridge Regression:
Lasso Regression:
y= a0+a1x+ ε
Here,
ε = random error
The values for x and y variables are training datasets for Linear Regression model
representation.
A linear line showing the relationship between the dependent and independent
variables is called a regression line. A regression line can show two types of
relationship:
The different values for weights or the coefficient of lines (a 0, a1) gives a different
line of regression, so we need to calculate the best values for a 0 and a1 to find
the best fit line, so to calculate this we use cost function.
o We can use the cost function to find the accuracy of the mapping
function, which maps the input variable to the output variable. This
mapping function is also known as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost
function, which is the average of squared error occurred between the
predicted values and actual values. It can be written as: For the above
linear equation, MSE can be calculated as.
Where,
Yi = Actual value
Residuals: The distance between the actual value and predicted values is called
residual. If the observed points are far from the regression line, then the residual
will be high, and so cost function will high. If the scatter points are close to the
regression line, then the residual will be small and hence the cost function.
Gradient Descent:
1. R-squared method:
Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the
values of independent variables. With homoscedasticity, there should be
no clear pattern distribution of data in the scatter plot.
Linear regression assumes that the error term should follow the normal
distribution pattern. If error terms are not normally distributed, then
confidence intervals will become either too wide or too narrow, which may
cause difficulties in finding coefficients.
It can be checked using the q-q plot. If the plot shows a straight line
without any deviation, which means the error is normally distributed.
No autocorrelations:
a1= It is the slope of the regression line, which tells whether the line is
increasing or decreasing.
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
data_set= pd.read_csv('Salary_Data.csv')
x= data_set.iloc[:, :-1].values
y= data_set.iloc[:, 1].values
Now the second step is to fit our model to the training dataset. To do so,
we will import the LinearRegression class of the linear_model library
from the scikit learn. After importing the class, we are going to create an
object of the class named as a regressor. The code for this is given
below:
Now in this step, we will visualize the training set result. To do so, we will
use the scatter() function of the pyplot library, which we have already
imported in the pre-processing step. The scatter () function will create a
scatter plot of observations.
fter that, we will assign labels for x-axis and y-axis using xlabel() and
ylabel() function.
Finally, we will represent all above things in a graph using show(). The
code is given below:
Here we are also changing the color of observations and regression line to
differentiate between the two plots, but it is optional.
1. All-in
2. Backward Elimination
3. Forward Selection
4. Bidirectional Elimination
5. Score Comparison
#importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv('50_CompList.csv')
#Extracting Independent and dependent Variable
x= data_set.iloc[:, :-1].values
y= data_set.iloc[:, 4].values
#Catgorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_x= LabelEncoder()
x[:, 3]= labelencoder_x.fit_transform(x[:,3])
onehotencoder= OneHotEncoder(categorical_features= [3])
x= onehotencoder.fit_transform(x).toarray()
#Avoiding the dummy variable trap:
x = x[:, 1:]
#Splitting the dataset into training and test set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, rando
m_state=0)
#Fitting the MLR model to the training set:
from sklearn.linear_model import LinearRegression
regressor= LinearRegression()
regressor.fit(x_train, y_train)
#Predicting the Test set result;
y_pred= regressor.predict(x_test)
#Checking the score
print('Train Score: ', regressor.score(x_train, y_train))
print('Test Score: ', regressor.score(x_test, y_test))
1. ?(ylog(p)+(1?y)log(1?p))
2. Confusion Matrix:
3. AUC-ROC curve:
o ROC curve stands for Receiver Operating Characteristics
Curve and AUC stands for Area Under the Curve.
o It is a graph that shows the performance of the classification model
at different thresholds.
o To visualize the performance of the multi-class classification model,
we use the AUC-ROC Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True
Positive Rate) on Y-axis and FPR(False Positive Rate) on X-axis.
o There is no particular way to determine the best value for "K", so we need
to try some values to find the best out of them. The most preferred value
for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the
effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which
means if a dataset can be classified into two classes by using a
single straight line, then such data is termed as linearly separable
data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated
data, which means if a dataset cannot be classified by using a
straight line, then such data is termed as non-linear data and
classifier used is called as Non-linear SVM classifier.
Hence, the SVM algorithm helps to find the best line or decision boundary; this
best boundary or region is called as a hyperplane. SVM algorithm finds the
closest point of the lines from both the classes. These points are called support
vectors. The distance between the vectors and the hyperplane is called
as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane
Non-Linear SVM:
So to separate these data points, we need to add one more dimension. For
linear data, we have used two dimensions x and y, so for non-linear data,
we will add a third dimension z. It can be calculated as:
z=x2 +y2
Where,
o Supervised Learning
o Semi-supervised Learning
o Unsupervised Learning
o Transduction
o Reinforcement Learning
o Decision Trees
o Probabilistic Networks
o Neural Networks
o Support Vector Machines
o Nearest Neighbor
On the other side, recall is the fraction of relevant instances that have
been retrieved over the total amount or relevant instances. The recall is
also known as sensitivity.
Model accuracy is a subset of model performance. The accuracy of the model is directly
proportional to the performance of the model. Thus, better the performance of the model,
more accurate are the predictions.
o Although they are built independently, but for Bagging, Boosting tries to
add new models which perform well where previous models fail.
o Only Boosting determines the weight for the data to tip the scales in favor
of the most challenging cases.
o Only Boosting tries to reduce bias. Instead, Bagging may solve the
problem of over-fitting while boosting can increase it.
The functions factor() and as.factor() are used to convert variables into
factors.