Regression PDF
Regression PDF
Regression PDF
Learning Outcomes: (Student to write briefly about learnings obtained from the academic tasks)
To understand the concept of Machine learning.
Declaration:
I declare that this Assignment is my individual work. I have not copied it from any other
student‟s work or from any other source except where due acknowledgement is made explicitly
in the text, nor has any part been written for me by any other person.
Student’s
Signature: Vishal Jaiswal
Evaluator’scomments (For Instructor’s use only)
data=school_grades_dataset
str(data)
abc=as.numeric(data$failures)
hh=step(lm(abc~sex+age+reason+internet,data=data))
summary(hh)
c=round(predict(hh),1)
d=data.frame(c,abc)
plot(abc,type="l",col="red")+lines(c,col="blue")
INTERPRETATION
The data which I have taken shows that how various factors affect the school grades of a
particular student.
Taking failures as dependant variable and combining the same with other major independent
factors sex, age. Reason, and internet the results are obtained.
Significance code show‘***’ three star which shows that data is significant.
CLUSTERING
Cluster is a group of objects that belongs to the same class. In other words, similar objects are
grouped in one cluster and dissimilar objects are grouped in another cluster.
Clustering is an unsupervised machine learning method that attempts to uncover the natural
groupings and statistical distributions of data. There are multiple clustering methods such as
K-means or Hierarchical Clustering. Often, a measure of distance from point to point is used
to find which category a point should belong to as with K-means. Hierarchical clustering seeks
to build up or break down sets of clusters based on the input information. This allows the user
to use the sets of clusters that best accomplish their purpose. The algorithm will not name the
groups it creates for you, but it will show you where they are and then they can be named
anything.
wine_quality
head(wine_quality)
plot(wine_quality$Quality~wine_quality$pH)
s1=wine_quality[,-1]
head(s1)
results.s1=kmeans(s1,centers = 5)
results.s1
results.s1$cluster
results.s1$size
INTERPRETATION
The data is about wine quality, which is dependant on so many factors. The quality has been
rated from minimum 3 to maximum 9.
The classification of observations into groups requires some methods for computing the
distance or the (dis)similarity between each pair of observations. The result of this computation
is known as a dissimilarity or distance matrix.
In case of classification, new data points get classified in a particular class on the basis of voting
from nearest neighbors.
In case of regression, new data get labeled based on the averages of nearest value.
It is a lazy learner because it doesn’t learn much from the training data.
Default method is Euclidean distance (shortest distance between 2 points, using formula =
(√(X1−X2)2+(Y1−Y2)2)((X1−X2)2+(Y1−Y2)2) ) used for continuous variables, whereas for
discrete variables, such as for text classification the overlap metric(Hamming distance) would
be employed.
Dataset: Credit cards
data=CREDIT_DATASET
View(CREDIT_DATASET)
CREDIT_DATASET$class=as.factor(CREDIT_DATASET$class)
str(CREDIT_DATASET)
set.seed(123)
index1=createDataPartition(CREDIT_DATASET$class,p=0.8, list=FALSE
)traindata1=CREDIT_DATASET[index1,]
testdata1=CREDIT_DATASET[-index1,]
modelknn=train(class~.,method="knn",data=traindata1)
modelknn
prediction1=predict(modelknn,testdata1)
conmatrix=confusionMatrix(prediction1,testdata1$class)
conmatrix
INTERPRETATION
K-nn (K-nearest neighbours) is also a supervised machine learning model. It uses both
classification and regression. In both cases the input consists of the k closest training examples
in future. In the credit data above it has given an accuracy of 69.5% as indicated above.
The algorithm is highly unbiased in nature and makes no prior assumption of the underlying
data. We have founded the frequency of the grades aong the data. We have classified the data
into test data and train data so that we can run the algorithm to find the desired results. At
first we have collected the data, and created test and train data for better understanding of the
model on the data set.
Confusion Matrix and Statistics
Reference
Prediction bad good
bad 12 13
good 48 127
Accuracy : 0.695
95% CI : (0.6261, 0.758)
No Information Rate : 0.7
P-Value [Acc > NIR] : 0.5953
Kappa : 0.1286
Mcnemar's Test P-Value : 1.341e-05
Sensitivity : 0.2000
Specificity : 0.9071
Pos Pred Value : 0.4800
Neg Pred Value : 0.7257
Prevalence : 0.3000
Detection Rate : 0.0600
Detection Prevalence : 0.1250
Balanced Accuracy : 0.5536
'Positive' Class : bad
2. It can predict values of one variable from values of another based on historical relationship
between independent and dependent variable.
Data used: Used the data from Kaggle for analyzing the youtube channel’s rating upon the
number of views and subscribers. Which will help us to find the number of people subscribing
after watching the video that means number of views leads to subscription. Thee link is given
at the end in the references.
Code:
library(caret)
reg
model=lm(Video.views~Subscribers,data=reg)
model
###predict and impredect the model
predict(model)
options(scipen=999)
summary(model)
###for removing 0
round(predict(model),0)
pred=round(predict(model),1)
class(pred)
plot(reg$Video.views,type = "l",col="blue")
lines(pred,col="green")
INTERPRETATION
The data which I have taken shows that how ranks are given to the different channels based
upon there views, number of videos uploaded and subscribers. The first step in interpreting the
multiple regression analysis is to examine the F-statistic and the associated p-value, at the
bottom of model summary. In our example, it can be seen that p-value of the F-statistic is 10.64,
which is highly significant. This means that, at least, one of the predictor variables is
significantly related to the outcome variable. When adjusted R- square goes below Multipal R-
square then there is a case of overfitting problem. In my example they are near by at the same
level which show there is not any overfitting problem.this also replicates that these variables
are required. Significance code show‘***’ three star which shows that data is significant.
REFERENCES:
Kaggle.com/datasets
Uci.com/datasets
Mldata.io