0% found this document useful (0 votes)
68 views5 pages

A Comparative Analysis On Linear Regression and Support Vector Regression

Uploaded by

r.pasalau24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views5 pages

A Comparative Analysis On Linear Regression and Support Vector Regression

Uploaded by

r.pasalau24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2016 Online International Conference on Green Engineering and Technologies (IC-GET)

A Comparative Analysis on Linear Regression and


Support Vector Regression
Kavitha S Varuna S Ramya R
Assistant Professor Assistant Professor Assistant Professor
Computer Science and Engineering Computer Science and Engineering Computer Science and Engineering
Bannari Amman Institute of Technolgy Bannari Amman Institute of Technolgy Bannari Amman Institute of Technolgy
Sathyamangalam Sathyamangalam Sathyamangalam
[email protected]
Abstract— In business, consumers interest, behavior, product reviews, blogs and discussion forum. These current streaming
profits are the insights required to predict the future of business data, historical data and consumer insights are combined to
with the current data or historical data. These insights can be predict the future events.
generated with the statistical techniques for the purpose of
forecasting. The statistical techniques can be evaluated for the B. Machine Learning
predictive model based on the requirements of the data. The Machine learning combines computer science and
prediction and forecasting are done widely with time series data.
Most of the applications such as weather forecasting, finance and
statistical analysis to enhance the prediction. High value
stock market combine historical data with the current streaming predictions can be obtained within human computer
data for better accuracy. However the time series data is interaction. It helps to predict uncertain situation with the data.
analyzed with regression models. In this paper, linear regression Machine learning algorithms are categorized as supervised and
and support vector regression model is compared using the unsupervised. Supervised algorithms are used to learn the
training data set in order to use the correct model for better labeled data and produce the result whereas unsupervised data
prediction and accuracy. does not use labeled data for learning. It simply used to obtain
the inference from data source. Here supervised learning
Index Terms—regression, linear regression, support vector,
algorithms of regression model are used for analyzing time
prediction, data analytics
series data.
In this paper we analyze linear and support vector
I. INTRODUCTION regression models in order to use the correct model for
Analytics focus on inference by statistical and prediction based on its requirements. This paper further
mathematical analysis of data. The analysis helps to identify consists of Related work in Section II, Supervised Machine
the problem from the collected data source. The solutions or Learning in Section III, Regression Analysis in Section IV,
other decisions can be provided with the data analytics tool like Experimental Settings in Section V and Conclusion and Future
Online Analytical Processing (OLAP). Later it uses various Work in Section VI.
tools and algorithms for better outcomes of data. Data
Analytics [1] can be descriptive, predictive and perspective. II. RELATED WORK
Descriptive analytics summarizes the data. Predictive analytics The data analytics has grown in tremendous after the
helps in predicting future outcomes whereas perspective evolution of Big data technologies. The data analytics with big
analytics includes predictive and feedback system to track the data technologies results with some data mining tools for big
outcomes produced by action. There are many technologies data analytics [1], [2]. The major purpose of these analytics is
used in the data analytics but predictive analytics is the one that prediction which is said to be Business Intelligence [3], [4].
uses machine learning algorithms and statistical analysis for Predictive analytics [5] rules the business market with BI
future prediction. The predictive analytics is used with the applications. These analytics are performed with machine
emerged as business intelligence applications. learning algorithms [6] over the data. There are several
machine learning algorithms used for classification and
A. Business Intelligence
regression. Mainly supervised learning models [7] are preferred
Business Intelligence [4] is the latest technology comprises because of its training data model. The supervised learning
with data analytics that drives the predictive data market. The algorithms for linear regression [8], [9] have several regression
technology provides tools and other software solutions for functions. Linear regression with LeastMedSq function is used
gaining the business insights to rule over the competitive to find the minimum square of all median and In support vector
business market. These applications help the business people to regression there are several functions subjected to linear and
make better decisions. BI applications also help to identify the non-linear kernels. Support Vector Regression [10], [11], [12],
current trend of the market and to address the business [13] with linear kernel can be used for linear regression. In
problems. These applications include cleansing data. The data SVR, Sequential Minimal Optimization (SMO) regression [14]
sources for this analysis are obtained from the dash boards, function is used along with the linear kernel function [15], [16]

978-1-5090-4556-3/16/$31.00 ©2016 IEEE


Authorized licensed use limited to: Polytechnic University of Bucharest. Downloaded on August 23,2023 at 10:54:32 UTC from IEEE Xplore. Restrictions apply.
2016 Online International Conference on Green Engineering and Technologies (IC-GET)

for linear regression. The models are evaluated with the metrics B. Regression
such as Root Mean Squared Error (RMSE) and Mean Absolute Regression is a statistical analysis method to identify the
Error (MAE) to estimate its performance to use the model for relationship between the variables. The relationship can be
particular business applications. identified between the dependent and independent variables. It
can be described using probability distribution functions
III. SUPERVISED MACHINE LEARNING
represented in Eq.1.
Supervised Machine Learning represented in Fig 1 infers a
function from labeled training data. The training data is Y = f (X , β ) (1)
mapped to new value from the input data and produce result
[7]. This method is faster and produces accurate results. The Where Y is a dependent variable, X is an independent variable
training dataset consist number of tuples (T). Each tuple is a and β is an unknown parameter.
vector which contains attribute values. The target data can have
more than one possible outcome or a continuous value.
(T ∪ X ) denotes T as input attributes that contains n
attributes. T = {a1,a2,a3,…….an} and y as target attribute. The
attributes can be unordered set values or real numbers. The
training sets are assumed to be generated randomly. The
supervised learning algorithm can be used for classification and
regression.

Fig. 3. Regression

The variable dependency can be either univariate or


multivariate regression. Univariate regression identifies the
dependency among single variable as represented in Eq. 2.
(2)
y = a + bx + ε
Fig. 1. Supervised Machine Learning
Where y is a dependent variable, x is an independent
variable with co-efficient b and a is a constant. while
multivariate regression [17] is to identify the dependency
A. Classification among several variables simultaneously is represented in Eq. 3.
Classification is the process of predicting labels based on
categories represented in Fig 2. For example we can classify y = a + b1 x + b2 x + .... + bn x + ε (3)
marks of a student as pass or fail. For classification process, the
classifier is built and it is used for further classification of data. In this paper, multivariate analysis is done with both
Building classifier is the learning process. It can be built using regression models.
the classification algorithm from the training data set. Each
tuple consists of training data set with the class label. Then the IV. REGRESSION MODELS
test data is used to evaluate the classification rules using the Regression models predict the outcome of the dependent
classifier. The training attributes can have the values of variables from the independent variables. The significance is
nominal and numeric whereas the target attribute can have considered in regression analysis to handle the most
more than two possible outcomes. complicated problems. In this paper, we discuss about linear
and support vector regression which best fits the predictive
model.
A. Linear Regression
Linear Regression [8], [9] is the most common predictive
model to identify the relationship among the variables. Apart
from univariate or multivariate data types the concept is linear.
Linear regression can be either simple linear or multiple linear
regression. The linear regression is described in Eq. 4.

Fig. 2. Classification

Authorized licensed use limited to: Polytechnic University of Bucharest. Downloaded on August 23,2023 at 10:54:32 UTC from IEEE Xplore. Restrictions apply.
2016 Online International Conference on Green Engineering and Technologies (IC-GET)

B. Support Vector Regression


y = xβ + ε (4) Support Vector Machine is one of the supervised learning
models for classification and regression [12], [13]. Support
In Eq. 4. y is the independent variable which can be either Vector Machine for regression is specifically said to be
continuous or categorical value, x is a dependent variable Support Vector Regression. Support Vector Regression can be
which is always a continuous value. It is analyzed with linear or non-linear using respective kernel functions.
probability distribution and mainly focused on conditional 1) Linear Support Vector Regression
probability distribution with multivariate analysis. Support Vector Regression uses linear kernel
1) Simple Linear Regression functions for regression which is similar to support vector
Simple linear regression represented in Fig. 4 is the machines but SVR sets the tolerance margin ( ε ) to
process of prediction using single independent variable approximation not like SVM which should be taken from
which is univariate regression analysis as described in Eq.2. the problem is represented in Fig, 6.
Simple linear regression distinct the dependent variables and
independent variables to extent the relationship between two
variables as similar to correlation but correlation does not
distinct the dependent and independent variables.

Fig. 6. Support Vector Linear Regression

Support Vector Regression with linear kernel


Fig. 4. Simple Linear Regression function is described in Eq. 5.
2) Multiple Linear Regression
y = w.x + b (5)
Multiple or multi variable linear regression
represented in Fig. 5 is the process of prediction with Where the input space is denoted by y, vector product
more than one independent or predictor variables which is is denoted as w.x and b is a constant. Based on the
similar to multivariate analysis as described in Eq. 3. error function Eq. 6, it can be minimized where the
target will be zi.
n
min
1 2
2
(
w + c  ξ i + ξ i* ) (6)
i =1

Subjected to

 z i − ( w.x + b) ≤ ε + ξi

( w..x) + b − z i ≤ ε + ξ i*
 *
ξ i ξ i ≥0

a) SMOReg Linear Kernel


Fig. 5. Multiple Linear Regression
SMOReg implements Support Vector Regression with
a) LeastMedSq Linear Regression various kernels. In this paper we use SMOReg with linear
In this regression model, random samples are used to kernel function for analysis of linear regression. Linear
create the least median square functions. The model is Kernel function is described in Eq. 7.
evaluated with the value obtained for the median square
which should be very minimum for its best fit. K ( x, y ) =< x, y > (7)

Authorized licensed use limited to: Polytechnic University of Bucharest. Downloaded on August 23,2023 at 10:54:32 UTC from IEEE Xplore. Restrictions apply.
2016 Online International Conference on Green Engineering and Technologies (IC-GET)

2) Non Linear Support Vector Regression TABLE I. DATA DEFINITION


In Non-Linear Support Vector Regression, the Data Set Open University Learning Analytics Dataset
non-linear kernel functions are used for processing the Data Set
training data in feature space as represented in Fig. 7. Multivariate, Sequential, Time Series
Characteristics
No. of Instances 1000

No. of Attributes 5
id_assessment
id_student
Attributes date_submitted
is_banked
score
Attribute
Integer
Characteristics
Test Mode Evaluate on training set

Classifier Model Full Training set

B. Experimental Results
Table II and Table III represents the metrics obtained for
the evaluation of LeastMedSq function for linear regression
model and SMOreg function which uses SVM with linear
Fig. 7. Non-Linear Support Vector Regression
kernel model respectively.

After processing the training data into feature space, TABLE II. LEASTMEDSQ FSUNCTION(TRAINING SET)
then normal support vector regression is applied using METRICS OBSERVED VALUE
Eq. 8. Time taken to build model 3.29 seconds
Correlation coefficient -0.0038
1 n
(*
)(
 2  α i − α i α j − α j k (xi , x j )
*
) Mean absolute error 9.6689
 i , j =1 (8)
max  Root mean squared error 12.54
n n
− ε
 
α i − α *
i(+  ) (
y i α i − α i* ) Relative absolute error 98.8263%
i =1 i =1 Root relative squared error 100.9408%

Total Number of Instances 1000


Subjected to
n

 (α
i =1
i − α i* = 0; ) 0 ≤ α i,α i* ≤ C The evaluation metrics in the above table specifies only
linear regression model with LeastMed sq function.

Where α i ,α i* are Lagrange multipliers TABLE III. SMOREG FUNCTION (TRAINING SET)

METRICS OBSERVED VALUE


V. EXPERIMENTAL SETTINGS
Time taken to build model 2.42 seconds
This section shows the experimental setup and results of Correlation coefficient -0.0029
linear regression models evaluated using training data set for
the best fit of the model that can be used for regression analysis Mean absolute error 10.0784
of time series data Root mean squared error 12.8725
A. Experimental Setup Relative absolute error 103.0124%
In this paper, the experiment analysis was carried out using Root relative squared error 103.6172%
an existing data analysis tool called Weka [18], [19]. The
dataset has been collected from the public data repository Total Number of Instances 1000
called UCI data repository [20]. The assessment data of
students is taken for multivariate analysis. The dataset consists The evaluation metrics in the above table specifies only
of 1000 instances which in turn consist of 5 attributes. The linear regression model with SMOreg function with linear
details of the dataset is represented in Table-I. The evaluation kernel where c=1.0.
is based on training set. The dataset will be classified
automatically as train and test data sets.

Authorized licensed use limited to: Polytechnic University of Bucharest. Downloaded on August 23,2023 at 10:54:32 UTC from IEEE Xplore. Restrictions apply.
2016 Online International Conference on Green Engineering and Technologies (IC-GET)

C. Performance Evaluation In future, the better model can be identified for linear
The performance is evaluated with the metrics obtained. regression with minimized RMSE and MAE and also
Here in this analysis we consider only two metrics i.e., Mean minimum time taken for constructing model on training data.
absolute error and Root mean squared error for evaluation.
REFERENCES
1) Evaluation Metrics
a) Mean absolute error [1] Philip Russom, Big Data Analytics, TDWI Best Practices
Report, 2011.
It calculates the mean of all absolute errors for all
predicted values which is described in Eq. 8. [2] Alfredo Cuzzocrea, Il-Yeol Song,Karen, C. Davis, "Analytics
over Large-Scale Multidimensional Data: The Big Data
n Revolution!", DOLAP’11, ACM, October 28, 2011.
 y
'
− y [3] James R. Evans, Carl H. Lindner, "Business Analytics: The
i =1
i i
(8) Next Frontier for Decision Sciences", Decision Science
n Institute, March 2012.
b) Root mean squared error [4] Surajit Chaudhuri, Umeshwar Dayal, and Vivek Narasayya, “An
It calculates the square root of all mean squared errors Overview of Business Intelligence Technology”,
which is described in Eq. 9. Communications of the ACM, Vol. 54, No. 81.1, August 2011.
n [5] Galit Shmueli, Otto R. Koppius, “Predictive Analytics in
 y
'

i
− y i
(9) Information Systems Research”, Mis Quarterly, Vol. 35, No. 3,
i =1 pp. 553-572, September 2011.
n [6] RS Michalski, JG Carbonell, TM Mitchell, "Machine learning:
The above metrics are compared to identify the best model An artificial intelligence approach",Springer-Verlag, 2013.
in terms of linear regression for multivariate data analysis. If [7] S. B. Kotsiantis, “Supervised Machine Learning: A Review of
the RMSE and MAE are minimal then the model is the best fit Classification Techniques”, J. of Informatica, Vol. 31, pp.249-
for linear regression. The comparison is represented in Fig 8. 268, 2007.
[8] GAF Seber, AJ Lee, "Linear regression analysis", Wiley Series
in Probability and Statistics,2012.
[9] DC Montgomery, EA Peck, GG Vining, "Introduction to linear
regression analysis", Wiley Series in Probability and Statistics,
2015.
[10] Zhang Xuegong, “Introduction to Statistical Learning Theory
and Support Vector Machines”, Acta Automatica Sinica, 2000.
[11] Smits G.F., Jordaan E.M., “Improved SVM Regression Using
Mixtures of Kernels”, IJCNN '02. Proceedings of the
International Joint Conference on Neural Networks, Vol.3,
2002.
[12] Alex J. Smola and Bernhard Scholkopf," A tutorial on support
vector regression", Statistics and computing, Springer, 2004.
[13] SR Gunn, "Support Vector Machines for Classification and
Regression", ISIS technical report, 1998.
[14] C Li, L Jiang, "Using locally weighted learning to improve
Fig. 8. Comparison of LinearMedSq and SMOReg Linear Kernel functions SMOreg for regression", Trends in Artificial Intelligence,
PRICAI 2006.
It is observed that LinearMedSq function best fits for linear [15] http://crsouza.blogspot.in/2010/03/kernel-functions-for-
regression though the time taken for building the training machine-learning.html
model is more compared to SMOReg function with linear [16] http://crsouza.blogspot.in/2010/04/kernel-support-vector-
kernel. machines-for.html
[17] Hair, J.F.," Multivariate data analysis", Upper Saddle
VI. CONCLUSION AND FUTURE WORK River, NJ [etc.]: Pearson Prentice Hall, 2006.
Data Analytics and business intelligence plays a major role [18] Mark Hall, “The Weka Data Mining Software: An Update”,
in the current competitive market. In case of analyzing a time ACM SIGKDD Explorations, Vol. 11, No. 1, pp. 10-18, June
series multivariate analysis, efficient data model should be 2009.
used for accurate results. If the linear regression model is used [19] Remco R. Bouckaert, “Weka Manual for Version 3-7-8”, The
then there are several functions associated with this model. In University of Waikato, January 2013.
this paper we analyzed the linear regression model with [20] UCI Machine Repository,
LeastMedSq function and SMOreg function over a multivariate http://archieve.ics.uci.edu/ml/datasets/Open+University+/Learni
and time series data set. The analytical results concluded that ng+Analytical+dataset
LeaseMedSq is the best model for linear regression.

Authorized licensed use limited to: Polytechnic University of Bucharest. Downloaded on August 23,2023 at 10:54:32 UTC from IEEE Xplore. Restrictions apply.

You might also like