0% found this document useful (0 votes)
9 views

Lab Experiment 4 - AI

Regression analyzes the relationships between variables. Linear regression finds a linear relationship between dependent and independent variables using historical data. Decision tree regression predicts outcomes by applying conditional rules to data. Random forests use majority voting among decision trees. Neural networks learn from data by adjusting weights to minimize prediction error. Data processing techniques like feature selection and outlier removal can improve models' predictions. The document discusses different regression techniques and how to prepare data for regression analysis.

Uploaded by

MUHAMMAD FAHEEM
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Lab Experiment 4 - AI

Regression analyzes the relationships between variables. Linear regression finds a linear relationship between dependent and independent variables using historical data. Decision tree regression predicts outcomes by applying conditional rules to data. Random forests use majority voting among decision trees. Neural networks learn from data by adjusting weights to minimize prediction error. Data processing techniques like feature selection and outlier removal can improve models' predictions. The document discusses different regression techniques and how to prepare data for regression analysis.

Uploaded by

MUHAMMAD FAHEEM
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Chapter 10

Regression

Abstract Regression estimates the relationship between dependent variables and


independent variables. Linear regression is an easily understood, popular basic
technique, which uses historical data to produce an output variable. Decision Tree
regression arrives at an estimate by applying conditional rules on the data, narrowing
possible values until a single prediction is made. Random Forests are clusters of
individual decision trees that produce a prediction by selecting a vote by majority
voting. Neural Networks are a representation of the brain and learns from the
data through adjusting weights to minimize the error of prediction. Proper Data
Processing techniques can further improve a model’s prediction such as ranking
feature importance and outlier removal.
Learning outcomes:
• Learn and apply basic models for regression tasks using sklearn and keras.
• Learn data processing techniques to achieve better results.
• Learn how to use simple feature selection techniques to improve our model.
• Data cleaning to help improve our model’s RMSE

Regression looks for relationships among variables. For example, you can
observe several employees of some company and try to understand how their salaries
depend on the features, such as experience, level of education, role, city they work
in, and so on.
This is a regression problem where data related to each employee represent one
observation. The presumption is that the experience, education, role, and city are the
independent features, and the salary of the employee depends on them.
Similarly, you can try to establish a mathematical dependence of the prices of
houses on their areas, numbers of bedrooms, distances to the city center, and so on.
Generally, in regression analysis, you usually consider some phenomenon of
interest and have a number of observations. Each observation has two or more
features. Following the assumption that (at least) one of the features depends on
the others, you try to establish a relation among them.
The dependent features are called the dependent variables, outputs, or responses.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 163
T. T. Teoh, Z. Rong, Artificial Intelligence with Python,
Machine Learning: Foundations, Methodologies, and Applications,
https://doi.org/10.1007/978-981-16-8615-3_10
164 10 Regression

The independent features are called the independent variables, inputs, or predic-
tors.
Regression problems usually have one continuous and unbounded dependent
variable. The inputs, however, can be continuous, discrete, or even categorical data
such as gender, nationality, brand, and so on.
It is a common practice to denote the outputs with x and inputs with y. If there
are two or more independent variables, they can be represented as the vector x =
(x1 , . . . , xr ), where r is the number of inputs.
When Do You Need Regression?
Typically, you need regression to answer whether and how some phenomenon
influences the other or how several variables are related. For example, you can use
it to determine if and to what extent the experience or gender impacts salaries.
Regression is also useful when you want to forecast a response using a new set
of predictors. For example, you could try to predict electricity consumption of a
household for the next hour given the outdoor temperature, time of day, and number
of residents in that household.
Regression is used in many different fields: economy, computer science, social
sciences, and so on. Its importance rises every day with the availability of large
amounts of data and increased awareness of the practical value of data.
It is important to note is that regression does not imply causation. It is easy to find
examples of non-related data that, after a regression calculation, do pass all sorts of
statistical tests. The following is a popular example that illustrates the concept of
data-driven “causality.”

It is often said that correlation does not imply causation, although, inadvertently,
we sometimes make the mistake of supposing that there is a causal link between two
variables that follow a certain common pattern
10 Regression 165

Dataset: “Alumni Giving Regression (Edited).csv”


You can obtain the dataset from this link:

https://www.dropbox.com/s/veak3ugc4wj9luz/Alumni%20Giving
→%20Regression%20%28Edited%29.csv?dl=0.

Also, you may run the following code in order to download the dataset in
google colab:
!wget https://www.dropbox.com/s/veak3ugc4wj9luz/Alumni%20Giving
→%20Regression%20%28Edited%29.csv?dl=0 -O
--quiet "./Alumni Giving Regression (Edited).csv"

!wget https://www.dropbox.com/s/veak3ugc4wj9luz/Alumni%20Giving
→%20Regression%20%28Edited%29.csv?dl=0 -O -quiet "./Alumni
→Giving Regression (Edited).csv"

from keras.models import Sequential


from keras.layers import Dense, Dropout
from sklearn.metrics import classification_report, confusion_
→matrix
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
from sklearn import linear_model
from sklearn import preprocessing
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor,
→GradientBoostingRegressor
import pandas as pd
import csv

Using TensorFlow backend.

In general, we will import dataset for structured dataset using pandas. We will
also demonstrate the code for loading dataset using NumPy to show the differences
between both libraries. Here, we are using a method in pandas call read_csv,
which takes the path of a csv file. ’CS’ in CSV represents comma separated. Thus,
if you open up the file in excel, you would see values separated by commas.

np.random.seed(7)
df = pd.read_csv("Alumni Giving Regression (Edited).csv",
→delimiter="," )
df.head()
166 10 Regression

A B C D E F
0 24 0.42 0.16 0.59 0.81 0.08
1 19 0.49 0.04 0.37 0.69 0.11
2 18 0.24 0.17 0.66 0.87 0.31
3 8 0.74 0.00 0.81 0.88 0.11
4 8 0.95 0.00 0.86 0.92 0.28

In pandas, it is very convenient to handle numerical data. Before doing any


model, it is good to take a look at some of the dataset’s statistics to get a “feel”
of the data. Here, we can simple call df.describe, which is a method in pandas
dataframe
df.describe()

A B C D
→ E F
count 123.000000 123.000000 123.000000 123.000000 123.
→000000 123.000000
mean 17.772358 0.403659 0.136260 0.645203 0.
→841138 0.141789
std 4.517385 0.133897 0.060101 0.169794 0.
→083942 0.080674
min 6.000000 0.140000 0.000000 0.260000 0.
→580000 0.020000
25% 16.000000 0.320000 0.095000 0.505000 0.
→780000 0.080000
50% 18.000000 0.380000 0.130000 0.640000 0.
→840000 0.130000
75% 20.000000 0.460000 0.180000 0.785000 0.
→910000 0.170000
max 31.000000 0.950000 0.310000 0.960000 0.
→980000 0.410000

Furthermore, pandas provides a helpful method to calculate the pairwise correla-


tion between two variables. What is correlation?
The term “correlation” refers to a mutual relationship or association between
quantities (numerical number). In almost any business, it is very helping to express
one quantity in terms of its relationship with others. We are concerned with
this because business plans and departments are not isolated! For example, sales
might increase when the marketing department spends more on advertisements,
or a customer’s average purchase amount on an online site may depend on his
or her characteristics. Often, correlation is the first step to understanding these
relationships and subsequently building better business and statistical models.
For example: “D” and “E” have a strong correlation of 0.93, which means that
when D moves in the positive direction E is likely to move in that direction too.
Here, notice that the correlation of A and A is 1. Of course, A would be perfectly
correlated with A.
10.1 Linear Regression 167

corr=df.corr(method ='pearson')
corr

A B C D E F
A 1.000000 -0.691900 0.414978 -0.604574 -0.521985 -0.549244
B -0.691900 1.000000 -0.581516 0.487248 0.376735 0.540427
C 0.414978 -0.581516 1.000000 0.017023 0.055766 -0.175102
D -0.604574 0.487248 0.017023 1.000000 0.934396 0.681660
E -0.521985 0.376735 0.055766 0.934396 1.000000 0.647625
F -0.549244 0.540427 -0.175102 0.681660 0.647625 1.000000

In general, we would need to test our model. train_test_split is a func-


tion in Sklearn model selection for splitting data arrays into two subsets for train-
ing data and for testing data. With this function, you do not need to divide the dataset
manually. You can use from the function train_test_split using the follow-
ing code sklearn.model_selection import train_test_split.
By default, Sklearn train_test_split will make random partitions for the two subsets.
However, you can also specify a random state for the operation.
Here, take note that we will need to pass in the X and Y to the function. X refers
to the features while Y refers to the target of the dataset.
Y_POSITION = 5
model_1_features = [i for i in range(0,Y_POSITION)]
X = df.iloc[:,model_1_features]
Y = df.iloc[:,Y_POSITION]

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_


→size=0.20, random_state=2020)

10.1 Linear Regression

Linear regression is a basic predictive analytics technique that uses historical data to
predict an output variable. It is popular for predictive modeling because it is easily
understood and can be explained using plain English.
The basic idea is that if we can fit a linear regression model to observed data, we
can then use the model to predict any future values. For example, let us assume that
we have found from historical data that the price (P) of a house is linearly dependent
upon its size (S)—in fact, we found that a house’s price is exactly 90 times its size.
The equation will look like this: P = 90*S
With this model, we can then predict the cost of any house. If we have a
house that is 1,500 square feet, we can calculate its price to be: P = 90*1500
= $135,000
168 10 Regression

There are two kinds of variables in a linear regression model:


• The input or predictor variable is the variable(s) that help predict the value of the
output variable. It is commonly referred to as X.
• The output variable is the variable that we want to predict. It is commonly
referred to as Y.
To estimate Y using linear regression, we assume the equation: Ye = α + β X
where Ye is the estimated or predicted value of Y based on our linear equation. Our
goal is to find statistically significant values of the parameters α and β that minimize
the difference between Y and Ye . If we are able to determine the optimum values of
these two parameters, then we will have the line of best fit that we can use to predict
the values of Y, given the value of X. So, how do we estimate α and β? We can use
a method called ordinary least squares.

The objective of the least squares method is to find values of α and β that
minimize the sum of the squared difference between Y and Ye . We will not delve
into the mathematics of least squares in our book.
Here, we notice that when E increases by 1, our Y increases by 0.175399. Also,
when C increases by 1, our Y falls by 0.044160.

model1 = linear_model.LinearRegression()
model1.fit(X_train, y_train)
y_pred_train1 = model1.predict(X_train)
print("Regression")
print("================================")
RMSE_train1 = mean_squared_error(y_train,y_pred_train1)

print("Regression Train set: RMSE ".format(RMSE_train1))


(continues on next page)
10.2 Decision Tree Regression 169

(continued from previous page)


print("================================")
y_pred1 = model1.predict(X_test)
RMSE_test1 = mean_squared_error(y_test,y_pred1)
print("Regression Test set: RMSE ".format(RMSE_test1))
print("================================")

coef_dict = {}
for coef, feat in zip(model1.coef_,model_1_features):
coef_dict[df.columns[feat]] = coef

print(coef_dict)

Regression
================================
Regression Train set: RMSE 0.0027616933222892287
================================
Regression Test set: RMSE 0.0042098240263563754
================================
{'A': -0.0009337757382417014, 'B': 0.16012156890162915, 'C': -
→0.04416001542534971, 'D': 0.15217907817100398, 'E': 0.
→17539950794101034}

10.2 Decision Tree Regression

A decision tree is arriving at an estimate by asking a series of questions to the data,


each question narrowing our possible values until the model gets confident enough
to make a single prediction. The order of the question and their content are being
determined by the model. In addition, the questions asked are all in a True/False
form.
This is a little tough to grasp because it is not how humans naturally think, and
perhaps the best way to show this difference is to create a real decision tree from. In
the above problem x1, x2 are two features that allow us to make predictions for the
target variable y by asking True/False questions.
The decision of making strategic splits heavily affects a tree’s accuracy. The
decision criteria are different for classification and regression trees. Decision trees
regression normally use mean squared error (MSE) to decide to split a node into
two or more sub-nodes. Suppose we are doing a binary tree; the algorithm will first
pick a value and split the data into two subsets. For each subset, it will calculate the
MSE separately. The tree chooses the value with results in smallest MSE value.
Let us examine how is Splitting Decided for Decision Trees Regressor in more
detail. The first step to create a tree is to create the first binary decision. How are
you going to do it?

You might also like