Lab Experiment 4 - AI
Lab Experiment 4 - AI
Regression
Regression looks for relationships among variables. For example, you can
observe several employees of some company and try to understand how their salaries
depend on the features, such as experience, level of education, role, city they work
in, and so on.
This is a regression problem where data related to each employee represent one
observation. The presumption is that the experience, education, role, and city are the
independent features, and the salary of the employee depends on them.
Similarly, you can try to establish a mathematical dependence of the prices of
houses on their areas, numbers of bedrooms, distances to the city center, and so on.
Generally, in regression analysis, you usually consider some phenomenon of
interest and have a number of observations. Each observation has two or more
features. Following the assumption that (at least) one of the features depends on
the others, you try to establish a relation among them.
The dependent features are called the dependent variables, outputs, or responses.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 163
T. T. Teoh, Z. Rong, Artificial Intelligence with Python,
Machine Learning: Foundations, Methodologies, and Applications,
https://doi.org/10.1007/978-981-16-8615-3_10
164 10 Regression
The independent features are called the independent variables, inputs, or predic-
tors.
Regression problems usually have one continuous and unbounded dependent
variable. The inputs, however, can be continuous, discrete, or even categorical data
such as gender, nationality, brand, and so on.
It is a common practice to denote the outputs with x and inputs with y. If there
are two or more independent variables, they can be represented as the vector x =
(x1 , . . . , xr ), where r is the number of inputs.
When Do You Need Regression?
Typically, you need regression to answer whether and how some phenomenon
influences the other or how several variables are related. For example, you can use
it to determine if and to what extent the experience or gender impacts salaries.
Regression is also useful when you want to forecast a response using a new set
of predictors. For example, you could try to predict electricity consumption of a
household for the next hour given the outdoor temperature, time of day, and number
of residents in that household.
Regression is used in many different fields: economy, computer science, social
sciences, and so on. Its importance rises every day with the availability of large
amounts of data and increased awareness of the practical value of data.
It is important to note is that regression does not imply causation. It is easy to find
examples of non-related data that, after a regression calculation, do pass all sorts of
statistical tests. The following is a popular example that illustrates the concept of
data-driven “causality.”
It is often said that correlation does not imply causation, although, inadvertently,
we sometimes make the mistake of supposing that there is a causal link between two
variables that follow a certain common pattern
10 Regression 165
https://www.dropbox.com/s/veak3ugc4wj9luz/Alumni%20Giving
→%20Regression%20%28Edited%29.csv?dl=0.
Also, you may run the following code in order to download the dataset in
google colab:
!wget https://www.dropbox.com/s/veak3ugc4wj9luz/Alumni%20Giving
→%20Regression%20%28Edited%29.csv?dl=0 -O
--quiet "./Alumni Giving Regression (Edited).csv"
!wget https://www.dropbox.com/s/veak3ugc4wj9luz/Alumni%20Giving
→%20Regression%20%28Edited%29.csv?dl=0 -O -quiet "./Alumni
→Giving Regression (Edited).csv"
In general, we will import dataset for structured dataset using pandas. We will
also demonstrate the code for loading dataset using NumPy to show the differences
between both libraries. Here, we are using a method in pandas call read_csv,
which takes the path of a csv file. ’CS’ in CSV represents comma separated. Thus,
if you open up the file in excel, you would see values separated by commas.
np.random.seed(7)
df = pd.read_csv("Alumni Giving Regression (Edited).csv",
→delimiter="," )
df.head()
166 10 Regression
A B C D E F
0 24 0.42 0.16 0.59 0.81 0.08
1 19 0.49 0.04 0.37 0.69 0.11
2 18 0.24 0.17 0.66 0.87 0.31
3 8 0.74 0.00 0.81 0.88 0.11
4 8 0.95 0.00 0.86 0.92 0.28
A B C D
→ E F
count 123.000000 123.000000 123.000000 123.000000 123.
→000000 123.000000
mean 17.772358 0.403659 0.136260 0.645203 0.
→841138 0.141789
std 4.517385 0.133897 0.060101 0.169794 0.
→083942 0.080674
min 6.000000 0.140000 0.000000 0.260000 0.
→580000 0.020000
25% 16.000000 0.320000 0.095000 0.505000 0.
→780000 0.080000
50% 18.000000 0.380000 0.130000 0.640000 0.
→840000 0.130000
75% 20.000000 0.460000 0.180000 0.785000 0.
→910000 0.170000
max 31.000000 0.950000 0.310000 0.960000 0.
→980000 0.410000
corr=df.corr(method ='pearson')
corr
A B C D E F
A 1.000000 -0.691900 0.414978 -0.604574 -0.521985 -0.549244
B -0.691900 1.000000 -0.581516 0.487248 0.376735 0.540427
C 0.414978 -0.581516 1.000000 0.017023 0.055766 -0.175102
D -0.604574 0.487248 0.017023 1.000000 0.934396 0.681660
E -0.521985 0.376735 0.055766 0.934396 1.000000 0.647625
F -0.549244 0.540427 -0.175102 0.681660 0.647625 1.000000
Linear regression is a basic predictive analytics technique that uses historical data to
predict an output variable. It is popular for predictive modeling because it is easily
understood and can be explained using plain English.
The basic idea is that if we can fit a linear regression model to observed data, we
can then use the model to predict any future values. For example, let us assume that
we have found from historical data that the price (P) of a house is linearly dependent
upon its size (S)—in fact, we found that a house’s price is exactly 90 times its size.
The equation will look like this: P = 90*S
With this model, we can then predict the cost of any house. If we have a
house that is 1,500 square feet, we can calculate its price to be: P = 90*1500
= $135,000
168 10 Regression
The objective of the least squares method is to find values of α and β that
minimize the sum of the squared difference between Y and Ye . We will not delve
into the mathematics of least squares in our book.
Here, we notice that when E increases by 1, our Y increases by 0.175399. Also,
when C increases by 1, our Y falls by 0.044160.
model1 = linear_model.LinearRegression()
model1.fit(X_train, y_train)
y_pred_train1 = model1.predict(X_train)
print("Regression")
print("================================")
RMSE_train1 = mean_squared_error(y_train,y_pred_train1)
coef_dict = {}
for coef, feat in zip(model1.coef_,model_1_features):
coef_dict[df.columns[feat]] = coef
print(coef_dict)
Regression
================================
Regression Train set: RMSE 0.0027616933222892287
================================
Regression Test set: RMSE 0.0042098240263563754
================================
{'A': -0.0009337757382417014, 'B': 0.16012156890162915, 'C': -
→0.04416001542534971, 'D': 0.15217907817100398, 'E': 0.
→17539950794101034}