0% found this document useful (0 votes)
61 views26 pages

HIV Regression Source Code

1. The document describes a Python module for importing, cleaning, and preparing data for machine learning. It defines classes to import data from a CSV file, clean the data by dropping duplicates, removing low variance features, imputing missing values, and controlling noise. 2. The data is then split into training and test sets using train_test_split. Target variables can be binarized using a threshold. 3. Additional classes visualize the target distribution in the training and test sets to check for imbalance. The cleaned and prepared data frames are accessible as attributes for further analysis or model training.

Uploaded by

Văn Thịnh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views26 pages

HIV Regression Source Code

1. The document describes a Python module for importing, cleaning, and preparing data for machine learning. It defines classes to import data from a CSV file, clean the data by dropping duplicates, removing low variance features, imputing missing values, and controlling noise. 2. The data is then split into training and test sets using train_test_split. Target variables can be binarized using a threshold. 3. Additional classes visualize the target distribution in the training and test sets to check for imbalance. The cleaned and prepared data frames are accessible as attributes for further analysis or model training.

Uploaded by

Văn Thịnh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Module 1: Import data

In [1]:
import numpy as np
import pandas as pd
import os
class import_data():
'use class.data to get data'
def __init__(self):
while True:
self.path = input("Please input the path file (EX:...HIV Classificat
checkPath = os.path.isfile(self.path)
if checkPath == True:
break
self.data = None
self.data_csv()
def data_csv(self):
self.data = pd.read_csv(self.path)
display(self.data.head(2))
while True:
try:
col_drop = input("Please input the columns you want to drop? (En
if len(col_drop.strip()) == 0:
break
self.data = self.data.drop([col_drop], axis = 1)
except:
print("Error columns")
display(self.data.head(2))

In [4]:
# /Users/macbook/Documents/CH2020/Database Regression/HIV regression/Database fu
df = import_data()
data = df.data

pChEMBL
Name Smiles nAcid ALogP
Value

0 1 O=C(N/N=C/c1c(O)cc(O)cc1)c1n[nH]c(C2CC2)c1 4.37 0 -1.8049 3.25766

1 2 O=C(NCc1occc1)c1cc(C(=O)NCc2occc2)cc(C(=O)NCc2... 5.10 0 -2.3319 5.43775

2 rows × 2529 columns

pChEMBL
nAcid ALogP ALogp2 AMR naAromAtom nAromBond nAtom nHeavyAtom
Value

0 4.37 0 -1.8049 3.257664 35.1281 11 11 35

1 5.10 0 -2.3319 5.437758 41.3184 21 21 54

2 rows × 2527 columns


Module 2: Data_cleaning
In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split

class Data_cleaned:

def __init__(self, data):


self.data_0= data
self.data = self.data_0.copy()

def Duplicate_data(self):

#Duplicate rows:
dup_rows = self.data.duplicated()
print(f"Total duplicated rows: {(dup_rows == True).sum()}")

print("Data befor drop duplicates:", self.data.shape[0])


self.data.drop_duplicates(inplace = True)
print("Data after drop duplicates:", self.data.shape[0])

#Duplicate collumns:
self.data = self.data.T
dup_cols = self.data.duplicated()
print(f"Total similar columns: {(dup_cols == True).sum()}")
print("Data befor drop duplicates:", self.data.shape[0])
self.data.drop_duplicates(inplace = True)

self.data = self.data.T
print("Data after drop duplicates:", self.data.shape[1])

def Variance_Threshold(self):
X = self.data.values[:, 1:]
y = self.data.values[:, 0]
print(X.shape, y.shape)

# Define thresholds to check


thresholds = np.arange(0.0, 1, 0.05)

# Apply transform with each threshold


results = list()
for t in thresholds:
# define the transform
transform = VarianceThreshold(threshold=t)
# transform the input data
X_sel = transform.fit_transform(X)
# determine the number of input features
n_features = X_sel.shape[1]
print('>Threshold=%.2f, Features=%d' % (t, n_features)) # store the
results.append(n_features)
# plot the threshold vs the number of selected features
plt.plot(thresholds, results)
plt.show()
#Clear variance Threshold:
def Low_variance_cleaning(self):
while True:
try:
inputthreshold=float(input("Please, input threshold you want:"
break
except:
print("Error threshold!")

def VarianceThreshold_selector(data, thresh):


df1 = data.copy(deep = True)
selector = VarianceThreshold(thresh)
selector.fit(df1)
features = selector.get_support(indices = False)
df2 = data.loc[:,features]
return df2

data5 = VarianceThreshold_selector(self.data, thresh = inputthreshold


data5.shape
self.data = data5

def Missing_value_cleaning(self):
(self.data.isnull().sum()).sum()
print("Total missing value", (self.data.isnull().sum()).sum())
null_data = self.data[self.data.isnull().any(axis=1)]
display(null_data)
print("Total row with missing value", null_data.shape[0])

print("Shape of Data after cleaning missing value:", self.data.shape


self.data.dropna(inplace = True)
print("Shape of Data before cleaning missing value:", self.data.shape

def Activate_Data_Cleaned(self):
self.Duplicate_data()
self.Variance_Threshold()
self.Low_variance_cleaning()
self.Missing_value_cleaning()

class noise_control(Data_cleaned):
def __init__(self, data):
self.data_0 = data
self.data = self.data_0.copy()
def feature_noise(self):
self.cols_remove = []
cols = []
while True:
feature_doub_1 = input("Please input 1st feature duplicated")
feature_doub_2 = input("Please input 2nd feature duplicated")
if feature_doub_1 and feature_doub_2 in self.data.columns:
cols.append(feature_doub_1)
cols.append(feature_doub_2)
self.cols_remove.append(feature_doub_1)
if len(feature_doub_1.strip()) == 0:
break
self.data_noise = self.data[cols]
def check_noise(self):
self.data_dif = pd.DataFrame()
for i in range(0, self.data_noise.shape[1]-1):
self.data_dif[f"{i}"] = self.data_noise.iloc[:, i+1] - self.
self.data_dif = self.data_dif.iloc[:, [i for i in range(0,self.data_dif

def check_index_noise(self):
self.idx = []
for i in range(0, self.data_dif.shape[1]):

for key, values in enumerate(self.data_dif.iloc[:,i]):


if values != 0.0:
self.idx.append(key)
self.idx = list(set(self.idx))

self.data.drop(self.idx, axis = 0, inplace = True)


self.data.drop(self.cols_remove, axis = 1, inplace = True)

def Activate_noise_control(self):
self.feature_noise()
self.check_noise()
self.check_index_noise()

class train_test_prepare(noise_control):

def __init__(self, data):


self.data_0 = data
self.data = self.data_0.copy()
self.Data_train = None
self.Data_test = None

def target_bin(self, thresh):


self.df1 = self.data.copy()
self.thresh = thresh
X_name=str(input("Please, input the X column's name:"))
t1 = self.df1[ X_name] < self.thresh
self.df1.loc[t1, X_name] = 0
t2 = self.df1[ X_name] >= self.thresh
self.df1.loc[t2, X_name] = 1
return self.df1

def Data_split(self):
self.df = self.data.copy()
while True:
self.RoC = input("Do you want to make classification?(Y/N)").title
if self.RoC == 'Y' or self.RoC == 'N':
break
if self.RoC.title() == "Y":
while True:
try:
self.thresh = float(input("Please input the threshold"))
break
except:
print("Error value!")
self.df1 = self.target_bin(thresh = self.thresh)
y = self.df1.iloc[:, 0].values
self.stratify = y
y = self.df1.iloc[:, 0].values
else:
self.stratify = None
y = self.df.iloc[:, 0].values

X = self.df.iloc[:, 1:].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size

#index:
self.idx = self.df.T.index

#Train:
self.df_X_train = pd.DataFrame(X_train)
self.df_y_train = pd.DataFrame(y_train)
self.df_train = pd.concat([self.df_y_train, self.df_X_train], axis
self.df_a = self.df_train.T
self.df_a = self.df_a.reset_index(drop = True)
for i in range(0,self.idx.size):
self.df_a.rename(index ={i: self.idx[i]},inplace= True)
self.Data_train = self.df_a.T

#test
self.df_X_test = pd.DataFrame(X_test)
self.df_y_test = pd.DataFrame(y_test)
self.df_test = pd.concat([self.df_y_test, self.df_X_test], axis = 1
self.df_b = self.df_test.T
self.df_b = self.df_b.reset_index(drop = True)
for i in range(0,self.idx.size):
self.df_b.rename(index ={i: self.idx[i]},inplace= True)
self.Data_test = self.df_b.T

def Visualize_target(self):
if self.RoC.title() == "Y":
plt.figure(figsize = (16,5))
plt.subplot(1,2,1)
plt.hist(self.Data_train.iloc[:,0])
plt.title(f'Imbalanced ratio: {((self.Data_train.iloc[:,0].values
plt.subplot(1,2,2)
plt.hist(self.Data_test.iloc[:,0])
plt.title(f'Imbalanced ratio: {((self.Data_test.iloc[:,0].values
plt.show()
else:
plt.figure(figsize = (16,5))
plt.subplot(1,2,1)
plt.hist(self.Data_train.iloc[:,0])
plt.title(f'Train distribution')
plt.subplot(1,2,2)
plt.hist(self.Data_test.iloc[:,0])
plt.title(f'Test distribution')
plt.show()

def Nomial(self):
DF_1 = self.data_0.select_dtypes("int64")
DF_2 = DF_1.loc[:, (DF_1.nunique() <10).values & (DF_1.max() <10).values
idx2 = DF_2.T.index #select columns with int64
idx3 = self.Data_train.T.index #select all columns in data_train
idx4 = idx2.intersection(idx3) #idx4 are int64 cols in Data_train
self.Data_train[idx4]=self.Data_train[idx4].astype('int64') #set all id
self.Data_test[idx4]=self.Data_test[idx4].astype('int64')
if self.RoC == 'Y':
self.Data_train.iloc[:,0] = self.Data_train.iloc[:,0].astype('int64'
self.Data_test.iloc[:,0] = self.Data_test.iloc[:,0].astype('int64'
def Activate(self):
self.Activate_Data_Cleaned()
self.Activate_noise_control()
self.Data_split()
self.Visualize_target()
self.Nomial()

In [6]:
df=train_test_prepare(data)
df.Activate()

Total duplicated rows: 42


Data befor drop duplicates: 1806
Data after drop duplicates: 1764
Total similar columns: 780
Data befor drop duplicates: 2527
Data after drop duplicates: 1747
(1764, 1746) (1764,)
>Threshold=0.00, Features=1744
>Threshold=0.05, Features=1080
>Threshold=0.10, Features=946
>Threshold=0.15, Features=863
>Threshold=0.20, Features=804
>Threshold=0.25, Features=675
>Threshold=0.30, Features=660
>Threshold=0.35, Features=647
>Threshold=0.40, Features=636
>Threshold=0.45, Features=632
>Threshold=0.50, Features=627
>Threshold=0.55, Features=625
>Threshold=0.60, Features=617
>Threshold=0.65, Features=611
>Threshold=0.70, Features=601
>Threshold=0.75, Features=594
>Threshold=0.80, Features=589
>Threshold=0.85, Features=584
>Threshold=0.90, Features=579
>Threshold=0.95, Features=575

Error threshold!

Total missing value 35


pChEMBL
nAcid ALogP ALogp2 AMR naAromAtom nAromBond nAtom nHeavyAto
Value

1003 5.98 0.0 -1.7872 3.194084 55.9045 6.0 6.0 38.0

1 rows × 1081 columns

Total row with missing value 1


Shape of Data after cleaning missing value: (1764, 1081)
Shape of Data before cleaning missing value: (1763, 1081)

Các cặp thông số mang ý nghĩa giống nhau nhưng không thể loại bỏ bằng
Duplicated_row! (Cần cập nhật để loại noise)

diameter : topoDiameter
radius : topoRadius
weinerPath : WPATH
weinerPol : WPOL
zagreb : Zagreb

In [7]:
Data_train = df.Data_train
Data_test = df.Data_test

Module 3: Outlier handling


In [9]:
import warnings
warnings.filterwarnings(action='ignore')
from sklearn.preprocessing import PowerTransformer, QuantileTransformer, KBinsDi
import matplotlib.pyplot as plt
import seaborn as sns
class Check_Univariate_outlier():
def __init__(self, data_train=None, data_test=None):
self.data_train_0 = data_train
self.data_test_0 = data_test
self.data_train = self.data_train_0.copy()
self.data_test = self.data_test_0.copy()
self.data_train_clean = None
self.data_test_clean = None
self.df = None
self.good = []
self.bad = []
self.scl1 = PowerTransformer()
self.scl2 = QuantileTransformer(output_distribution = "normal")
self.scl3 = QuantileTransformer(output_distribution = "uniform")

def Check_remove_data(self): #self.


self.df_train = self.data_train.copy()
self.df_test = self.data_test.copy()
for col_name in self.df_train.select_dtypes("float64").columns:
q1 = self.df_train[col_name].quantile(0.25)
q3 = self.df_train[col_name].quantile(0.75)
iqr = q3 - q1
low = q1-1.5*iqr
high = q3+1.5*iqr
self.df_train = self.df_train[(self.df_train[col_name] <= high)
self.df_test = self.df_test[(self.df_test[col_name] <= high) &
print("Total data remove on Train", self.data_train_0.shape[0] -self
print("Total data remove on Test", self.data_test_0.shape[0] -self.
self.data_train_clean = self.df_train
self.data_test_clean = self.df_test

def Check_quantity_features(self):
self.good = []
self.bad = []
self.df_train = self.data_train.copy()
self.df_test = self.data_test.copy()
for col_name in self.df_train.select_dtypes("float64").columns:
q1 = self.df_train[col_name].quantile(0.25)
q3 = self.df_train[col_name].quantile(0.75)
iqr = q3-q1
remove = self.data_train.shape[0] - (self.df_train[(self.df_train
if remove == 0:
self.good.append(col_name)
else:
self.bad.append(col_name)
print(f"Number of good features: {len(self.good)}")
print(f"Number of bad features with data remove > 0: {len(self.bad)
print("*"*75)
def Check_remove_outlier(self):
self.Check_remove_data()
self.Check_quantity_features()

def Outlier_Winsor(self):
print("Handling with Winsorization method")
self.df_train = self.data_train_0.copy()
self.df_test = self.data_test_0.copy()
for col_name in self.df_train.select_dtypes(include="float64").columns
q1 = self.df_train[col_name].quantile(0.25)
q3 = self.df_train[col_name].quantile(0.75)
iqr = q3-q1
self.df_train.loc[(self.df_train[col_name] <= (q1-1.5*iqr)), col_nam
self.df_train.loc[(self.df_train[col_name] >= (q3+1.5*iqr)), col_nam
#for test
self.df_test.loc[(self.df_test[col_name] <= (q1-1.5*iqr)), col_name
self.df_test.loc[(self.df_test[col_name] >= (q3+1.5*iqr)), col_name
self.data_train = self.df_train
self.data_test = self.df_test
self.Check_remove_outlier()

def Transformation(self):
self.df_train = self.data_train_0.copy()
self.df_test = self.data_test_0.copy()
#Train
while True:
try:
self.transformer = int(input("Please select type of transformati
break
except:
print("Error values! Input number!")
if self.transformer == 1:
self.scl =self.scl1
print("Handling with Transformation_Powertransformer method")
elif self.transformer == 2:
self.scl =self.scl2
print("Handling with Transformation_Gaussiantransformer method"
else:
self.scl =self.scl3
print("Handling with Transformation_Uniformtransformer method")

#Train

df_train_int = self.df_train.select_dtypes("int64")
df_train_int = df_train_int.reset_index(drop = True)

y_train = self.df_train.select_dtypes("float64").iloc[:,0].values
X_train = self.df_train.select_dtypes("float64").iloc[:,1:].values

self.scl.fit(X_train)
X_train_trans = self.scl.transform(X_train)
idx = self.df_train.select_dtypes("float64").T.index

df_X_train = pd.DataFrame(X_train_trans)
df_y_train = pd.DataFrame(y_train)
df_train = pd.concat([df_y_train, df_X_train], axis = 1)
df_a = df_train.T
df_a = df_a.reset_index(drop = True)
for i in range(0,idx.size):
df_a.rename(index ={i: idx[i]},inplace= True)
Data_train_float = df_a.T
self.data_train = pd.concat([Data_train_float , df_train_int], axis

#test
df_test_int = self.df_test.select_dtypes("int64")
df_test_int = df_test_int.reset_index(drop = True)

y_test = self.df_test.select_dtypes("float64").iloc[:,0].values
X_test = self.df_test.select_dtypes("float64").iloc[:,1:].values

X_test_trans = self.scl.transform(X_test)
idx = self.df_test.select_dtypes("float64").T.index

df_X_test = pd.DataFrame(X_test_trans)
df_y_test = pd.DataFrame(y_test)
df_test = pd.concat([df_y_test, df_X_test], axis = 1)
df_b = df_test.T
df_b = df_b.reset_index(drop = True)
for i in range(0,idx.size):
df_b.rename(index ={i: idx[i]},inplace= True)
Data_test_float = df_b.T
self.data_test = pd.concat([Data_test_float , df_test_int], axis =
self.Check_remove_outlier()
input_point = input("Do you want to use KBin method for this Transformat
point = input_point.title()
if point == "Y":
self.KBin()
else:
pass

def KBin (self):


print("Handling with KBin method")
#Train
self.data_train_int = self.data_train.select_dtypes('int64')
self.data_train_good = self.data_train[self.good]
self.data_train_bad = self.data_train[self.bad]
self.n_bins = int(input("Please input number of bins"))
self.encode = input("Please input type of encode")
self.strategy = input("Please input type of strategy")
kst = KBinsDiscretizer(n_bins = 3, encode = self.encode, strategy =
kst.fit(self.data_train_bad)
self.bad_new = pd.DataFrame(kst.transform(self.data_train_bad)).astype

self.data_train_clean = pd.concat([self.data_train_good,self.bad_new
self.data_train = self.data_train_clean

#test
self.data_test_int = self.data_test.select_dtypes('int64')
self.data_test_good = self.data_test[self.good]
self.data_test_bad = self.data_test[self.bad]
self.bad_new = pd.DataFrame(kst.transform(self.data_test_bad)).astype
self.data_test_clean = pd.concat([self.data_test_good,self.bad_new,
self.data_test = self.data_test_clean

self.Check_remove_outlier()

def Activate_Check(self):
print('remove by IQR without handling')
self.Check_remove_outlier()
self.Outlier_Winsor()
self.Transformation()

In [10]:
df1 = Check_Univariate_outlier(Data_train, Data_test)
df1.Check_remove_outlier()
df1.Activate_Check()
Total data remove on Train 1399
Total data remove on Test 351
Number of good features: 73
Number of bad features with data remove > 0: 673
***************************************************************************
remove by IQR without handling
Total data remove on Train 1399
Total data remove on Test 351
Number of good features: 73
Number of bad features with data remove > 0: 673
***************************************************************************
Handling with Winsorization method
Total data remove on Train 0
Total data remove on Test 0
Number of good features: 746
Number of bad features with data remove > 0: 0
***************************************************************************

Handling with Transformation_Uniformtransformer method


Total data remove on Train 1398
Total data remove on Test 350
Number of good features: 700
Number of bad features with data remove > 0: 46
***************************************************************************

Handling with KBin method

Total data remove on Train 0


Total data remove on Test 0
Number of good features: 700
Number of bad features with data remove > 0: 0
***************************************************************************

In [11]:
Data_train = df1.data_train
Data_test = df1.data_test

In [13]:
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.covariance import EllipticEnvelope
from sklearn.neighbors import LocalOutlierFactor

LOF = LocalOutlierFactor(n_neighbors = 20)


robust_cov = EllipticEnvelope(contamination=0.1)
emp_cov = EllipticEnvelope(contamination=0.1, support_fraction =1)
o_SVM = OneClassSVM()
iso_forest = IsolationForest(n_estimators=100, contamination=0.10)
class Mutivariate():
def __init__(self, data_train, data_test):
self.data_train_0 = data_train
self.data_test_0 = data_test

def LOF(self):
self.data_train_LOF = self.data_train_0.copy()
self.data_test_LOF = self.data_test_0.copy()
while True:
try:
self.n_neighbors = int(input("Please input number of neighbors f
break
except:
print("Error values!")
LOF = LocalOutlierFactor(n_neighbors = self.n_neighbors)
LOF.fit(self.data_train_LOF)
self.Outlier_LOF = self.data_train_LOF[LOF.fit_predict(self.data_train_L
self.Data_train_LOF = self.data_train_LOF[LOF.fit_predict(self.data_trai
print(f"Total outlier remove by LOF:", self.Outlier_LOF.shape[0])
#Test
LOF = LocalOutlierFactor(n_neighbors = self.n_neighbors, novelty =
LOF.fit(self.data_train_LOF)
self.Data_test_LOF = self.data_test_LOF[LOF.predict(self.data_test_LOF

def Ist_for(self):
self.data_train_Ist_for = self.data_train_0.copy()
self.data_test_Ist_for = self.data_test_0.copy()
while True:
try:
self.n_estimators = int(input("Please input number of estimators
self.contamination = float(input("Please input number of contami
break
except:
print("Error values!")
Iso_for = IsolationForest(n_estimators=self.n_estimators, contamination
Iso_for.fit(self.data_train_Ist_for)
self.Outlier_iso = self.data_train_Ist_for[Iso_for.predict(self.data_tra
self.Data_train_iso = self.data_train_Ist_for[Iso_for.predict(self.
self.Data_test_iso = self.data_test_Ist_for[Iso_for.predict(self.data_te
print(f"Total outlier remove by Isolation forest:", self.Outlier_iso

def o_SVM(self):
self.data_train_o_SVM = self.data_train_0.copy()
self.data_test_o_SVM = self.data_test_0.copy()
o_SVM = OneClassSVM()
o_SVM.fit(self.data_train_o_SVM)
self.Outlier_osvm = self.data_train_o_SVM[o_SVM.predict(self.data_train_
self.Data_train_osvm = self.data_train_o_SVM[o_SVM.predict(self.data_tra
self.Data_test_osvm = self.data_test_o_SVM[o_SVM.predict(self.data_test_
print(f"Total outlier remove by One Class SVM:", self.Outlier_osvm.

def robust_cov(self):
self.data_train_r_cov = self.data_train_0.copy()
self.data_test_r_cov = self.data_test_0.copy()
while True:
try:
self.contamination = float(input("Please input number of contami
break
except:
print("Error values!")
robust_cov = EllipticEnvelope(contamination= self.contamination)
robust_cov.fit(self.data_train_r_cov)
self.Outlier_rcov = self.data_train_r_cov[robust_cov.predict(self.data_t
self.Data_train_rcov = self.data_train_r_cov[robust_cov.predict(self
self.Data_test_rcov = self.data_test_r_cov[robust_cov.predict(self.
print(f"Total outlier remove by Robust covariance:", self.Outlier_rcov

def emp_cov(self):
self.data_train_e_cov = self.data_train_0.copy()
self.data_test_e_cov = self.data_test_0.copy()
while True:
try:
self.contamination = float(input("Please input number of contami
self.support_fraction = float(input("Please input number of supp
break
except:
print("Error values!")
emp_cov = EllipticEnvelope(contamination= self.contamination, support_fr
emp_cov.fit(self.data_train_e_cov)
self.Outlier_ecov = self.data_train_e_cov[emp_cov.predict(self.data_trai
self.Data_train_ecov = self.data_train_e_cov[emp_cov.predict(self.data_t
self.Data_test_ecov = self.data_test_e_cov[emp_cov.predict(self.data_tes
print(f"Total outlier remove by Emperical covariance:", self.Outlier_eco
def Visualize_Outlier(self):
self.LOF()
self.Ist_for()
self.o_SVM()
self.robust_cov()
self.emp_cov()
Models = [('Local Outlier Factor', self.Outlier_LOF.shape[0]), ('Isolat
('One Class SVM', self.Outlier_osvm.shape[0]),('Robust covaria
for name, N_out in Models:
plt.rcParams["figure.figsize"] = (20,8)
plt.bar(name,N_out)
def Mutivariate_Outlier_Handling(self):
while True:
try:
algo = input("Please select algorithm for multivariate method:
break
except:
print("Wrong! Please input number from 1-5.")
if algo == 1:
self.LOF()
elif algo == 2:
self.Ist_for()
elif algo == 3:
self.o_SVM()
elif algo == 4:
self.robust_cov()
elif algo == 5:
self.emp_cov()
else:
self.Mutivariate_Outlier_Handling()

In [ ]:
df4= Mutivariate(Data_train, Data_test)
df4.Visualize_Outlier()

In [14]:
df4= Mutivariate(Data_train, Data_test)
df4.LOF()

Total outlier remove by LOF: 27

In [15]:
Data_train = df4.Data_train_LOF
Data_test = df4.Data_test_LOF

Module 4: Rescale
In [16]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
class rescale(Mutivariate):
def __init__(self, data_train, data_test):
self.data_train_0 = data_train
self.data_test_0 = data_test
self.scl1 = MinMaxScaler()
self.scl2 = StandardScaler()
self.scl3 = RobustScaler()
def rescale_fit(self):
self.data_train = self.data_train_0.copy()
self.data_test = self.data_test_0.copy()
while True:
try:
self.transformer = int(input("Please select type of transformati
break
except:
print("Error value")
if self.transformer == 1:
self.scl =self.scl1
elif self.transformer == 2:
self.scl =self.scl2
else:
self.scl =self.scl3
#Train
df_train_int = self.data_train.select_dtypes("int64")
df_train_int = df_train_int.reset_index(drop = True)

y_train = self.data_train.select_dtypes("float64").iloc[:,0].values
X_train = self.data_train.select_dtypes("float64").iloc[:,1:].values

self.scl.fit(X_train)
X_train_trans = self.scl.transform(X_train)
idx = self.data_train.select_dtypes("float64").T.index

df_X_train = pd.DataFrame(X_train_trans)
df_y_train = pd.DataFrame(y_train)
df_train = pd.concat([df_y_train, df_X_train], axis = 1)
df_a = df_train.T
df_a = df_a.reset_index(drop = True)
for i in range(0,idx.size):
df_a.rename(index ={i: idx[i]},inplace= True)
Data_train_float = df_a.T
self.Data_train = pd.concat([Data_train_float , df_train_int], axis

#Test
df_test_int = self.data_test.select_dtypes("int64")
df_test_int = df_test_int.reset_index(drop = True)

y_test = self.data_test.select_dtypes("float64").iloc[:,0].values
X_test = self.data_test.select_dtypes("float64").iloc[:,1:].values

X_test_trans = self.scl.transform(X_test)
idx = self.data_test.select_dtypes("float64").T.index

df_X_test = pd.DataFrame(X_test_trans)
df_y_test = pd.DataFrame(y_test)
df_test = pd.concat([df_y_test, df_X_test], axis = 1)
df_b = df_test.T
df_b = df_b.reset_index(drop = True)
for i in range(0,idx.size):
df_b.rename(index ={i: idx[i]},inplace= True)
Data_test_float = df_b.T
self.Data_test = pd.concat([Data_test_float , df_test_int], axis =

In [17]:
df5 = rescale(Data_train, Data_test)
df5.rescale_fit()

In [18]:
df5.Data_train.head(2)

Out[18]: pChEMBL
ALogP ALogp2 AMR naAromAtom nAromBond nAtom nHeavyAtom
Value

0 8.4 -1.192863 1.154339 1.061050 -0.899420 -0.114518 0.770779 0.863838

1 6.3 -1.641491 1.635178 -1.545092 -0.074438 -0.114518 -1.227154 -1.383974

2 rows × 1085 columns

In [19]:
Data_train = df5.Data_train
Data_test = df5.Data_test

In [20]:
X_train = Data_train.iloc[:,1:].values
y_train = Data_train.iloc[:,0].values
X_test = Data_test.iloc[:,1:].values
y_test = Data_test.iloc[:,0].values

Module 5: Feature selection


In [27]:
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.linear_model import LinearRegression, LassoCV, ElasticNetCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from sklearn.model_selection import KFold, cross_val_score, RepeatedKFold
from sklearn.svm import LinearSVR

In [28]:
import matplotlib.pyplot as plt
class feature_selection:
def __init__(self, data_train, data_test):
self.X_train = data_train.iloc[:,1:].values
self.y_train = data_train.iloc[:,0].values
self.X_test = data_test.iloc[:,1:].values
self.y_test = data_test.iloc[:,0].values
self.result = list()
self.name = list()
def random_forest(self):
forest = RandomForestRegressor(random_state=42)
forest.fit(self.X_train, self.y_train)
model_RF = SelectFromModel(forest, prefit=True)
self.X_train_new = model_RF.transform(self.X_train)
self.X_test_new = model_RF.transform(self.X_test)
self.name.append("Random Forest")
self.check_intenal_performance()

def extra_tree(self):
ext_tree = ExtraTreesRegressor(random_state=42)
ext_tree.fit(self.X_train, self.y_train)
model_ext_tree = SelectFromModel(ext_tree, prefit=True)
self.X_train_new = model_ext_tree.transform(self.X_train)
self.X_test_new = model_ext_tree.transform(self.X_test)
self.name.append("ExtraTree")

self.check_intenal_performance()

def ada(self):
ada = AdaBoostRegressor(random_state=42)
ada.fit(self.X_train, self.y_train)
model_ada = SelectFromModel(ada, prefit=True)
self.X_train_new = model_ada.transform(self.X_train)
self.X_test_new = model_ada.transform(self.X_test)
self.name.append("AdaBoost")

self.check_intenal_performance()

def grad(self):
grad = GradientBoostingRegressor(random_state=42)
grad.fit(self.X_train, self.y_train)
model_grad = SelectFromModel(grad, prefit=True)
self.X_train_new = model_grad.transform(self.X_train)
self.X_test_new = model_grad.transform(self.X_test)
self.name.append("GradientBoost")
self.check_intenal_performance()

def XGb(self):
XGb = XGBRegressor(random_state=42)
XGb.fit(self.X_train, self.y_train)
model_XGb = SelectFromModel(XGb, prefit=True)
self.X_train_new = model_XGb.transform(self.X_train)
self.X_test_new = model_XGb.transform(self.X_test)
self.name.append("XGBoost")
self.check_intenal_performance()

def Lasso(self):
lasso = LassoCV(random_state = 42)
lasso.fit(self.X_train, self.y_train)
model_lasso = SelectFromModel(lasso, prefit=True)
self.X_train_new = model_lasso.transform(self.X_train)
self.X_test_new = model_lasso.transform(self.X_test)
self.name.append("lasso")
self.check_intenal_performance()

def ELN(self):
ELN = ElasticNetCV(random_state = 42)
ELN.fit(self.X_train, self.y_train)
model_ELN = SelectFromModel(ELN, prefit=True)
self.X_train_new = model_ELN.transform(self.X_train)
self.X_test_new = model_ELN.transform(self.X_test)
self.name.append("ElasticNet")
self.check_intenal_performance()

def feature_importance(self):
model = RandomForestRegressor(random_state = 42)
model.fit(self.X_train, self.y_train)
importance = model.feature_importances_
while True:
threshold = float(input("Select features importances threshold"
print("The remain features = ", (importance > threshold).sum())
action = input("Do you want to check another threshold?(Y/N)")
if action.title() == 'N':
break
self.X_train_new=self.X_train[:,importance > threshold]
self.X_test_new= self.X_test[:,importance > threshold]
self.name.append("Feature Importance")
self.check_intenal_performance()

def check_performance(self):
forest_model = RandomForestRegressor(random_state=42)
forest_model.fit(self.X_train_new, self.y_train)

self.r2= r2_score(y_test, forest_model.predict(self.X_test_new))


self.MSE = mean_squared_error(y_test, forest_model.predict(self.X_test_n
self.RMSE = np.sqrt(self.MSE)
self.MAE = mean_absolute_error(y_test, forest_model.predict(self.X_test_

print("R2 = ", self.r2)


print("MSE = ", self.MSE)
print("RMSE = ", self.RMSE)
print("MAE = ", self.MAE)

def check_intenal_performance(self):
cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=42)
in_model = RandomForestRegressor(random_state=42)
score_internal = cross_val_score(in_model, self.X_train_new, self.y_trai
print(score_internal.mean())
self.result.append(score_internal)

def model_feature_selection(self):
while True:
try:
models = int(input("Please select algorithm for feature selectio
break
except:
print("\nWrong values! Input number from 1-5!")
if models == 1:
self.random_forest()
elif models == 2:
self.extra_tree()
elif models ==3:
self.ada()
elif models == 4:
self.grad()
elif models == 5:
self.XGb()
elif models == 6:
self.feature_importance()
else:
self.model_feature_selection()
def compare_model(self):
fig =plt.figure(figsize = (20,8))
self.result = list()
self.name = list()
self.random_forest()
self.extra_tree()
self.ada()
self.grad()
self.XGb()
self.Lasso()
self.ELN()
self.feature_importance()

plt.boxplot(self.result, labels=self.name, showmeans=True)


plt.show()
fig.savefig("Compare feature selection method.png", transparent = True

In [29]:
Descriptor_select = feature_selection(Data_train, Data_test)
Descriptor_select.compare_model()

-0.7358630025248843
-0.7281728504218296
-0.745356557436394
-0.7261001611653198
-0.7343649689156617
-0.7552536490535624
-0.7507194810054509

The remain features = 180

-0.7360665483457994
In [31]:
# Use Anova test to choose feature selection method
d = pd.DataFrame(Descriptor_select.result)
idx = Descriptor_select.name
for i in range(0,len(idx)):
d.rename(index ={i: idx[i]},inplace= True)
check_result = d.T

import scipy.stats as stats


# stats f_oneway functions takes the groups as input and returns ANOVA F and p v
Ftest = stats.f_oneway(check_result['Random Forest'], check_result['ExtraTree'
print(f"FTest pvalue = {Ftest[1]}")

Ttest = stats.ttest_ind(check_result['Random Forest'], check_result['ExtraTree'


print(f"TTest pvalue = {Ttest[1]}")

FTest pvalue = 0.8552435952594457


TTest pvalue = 0.6130101128631499
Cả 2 test đều có giá trị p_value > 0.05 => khác biệt giữa các pp không có y nghĩa thống

Trong đó có 3pp:

Extra Tree
XGboost
ElasticNet CV

15 fold đều cho kết quả không lệch quá nhiều Chọn KQ tốt nhất là Extra Tree

In [35]:
Descriptor_select = feature_selection(Data_train, Data_test)
Descriptor_select.extra_tree()

-0.7281728504218296
RFE METHOD # evaluate RFE for regression from numpy import mean from numpy import std from
sklearn.datasets import make_regression from sklearn.model_selection import cross_val_score from
sklearn.model_selection import RepeatedKFold from sklearn.feature_selection import RFECV from
sklearn.pipeline import Pipeline # create pipeline rfe =
RFECV(estimator=RandomForestRegressor(random_state=42)) model =
RandomForestRegressor(random_state=42) pipeline = Pipeline(steps=[('s',rfe),('m',model)]) # evaluate
model cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=42) n_scores =
cross_val_score(pipeline, X_train, y_train, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1,
error_score='raise') # report performance print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Module 6: Model Prepare


In [36]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression, ElasticNetCV, Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from sklearn.svm import SVR
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.model_selection import KFold
from sklearn.cross_decomposition import PLSRegression

In [37]:
X_train = Descriptor_select.X_train_new
X_test = Descriptor_select.X_test_new
y_train = Descriptor_select.y_train
y_test = Descriptor_select.y_test

1. Auto Model
In [ ]:
from Auto_ML.Auto_ML_HHC import LabHHCRegressor
reg = LabHHCRegressor(verbose=0, ignore_warnings=False, custom_metric=None)
models, predictions = reg.fit(X_train, X_test, y_train, y_test)
models

In [114…
models

Out[114… Adjusted R- R- Time


RMSE MAE MAPE
Squared Squared Taken

Model

CatBoostRegressor 0.62 0.76 0.62 0.47 0.08 11.57

ExtraTreesRegressor 0.62 0.75 0.63 0.46 0.08 2.43

HistGradientBoostingRegressor 0.61 0.75 0.63 0.47 0.08 1.09

LGBMRegressor 0.61 0.75 0.64 0.48 0.08 0.57

SVR 0.59 0.73 0.65 0.49 0.08 0.25

RandomForestRegressor 0.58 0.73 0.66 0.49 0.08 4.41

KNeighborsRegressor 0.58 0.73 0.66 0.49 0.08 0.04

NuSVR 0.58 0.73 0.66 0.50 0.08 0.23

MLPRegressor 0.54 0.71 0.69 0.52 0.09 1.37

GradientBoostingRegressor 0.53 0.70 0.70 0.54 0.09 2.12

XGBRegressor 0.53 0.70 0.70 0.53 0.09 0.72

BaggingRegressor 0.49 0.67 0.73 0.54 0.09 0.48

SGDRegressor 0.36 0.59 0.81 0.64 0.11 0.03


RidgeCV 0.36 0.59 0.81 0.65 0.11 0.03

Ridge 0.36 0.59 0.82 0.65 0.11 0.02

HuberRegressor 0.35 0.58 0.82 0.63 0.11 0.12

BayesianRidge 0.35 0.58 0.82 0.66 0.11 0.04

AdaBoostRegressor 0.35 0.58 0.82 0.68 0.11 0.74

LinearRegression 0.35 0.58 0.82 0.65 0.11 0.02

TransformedTargetRegressor 0.35 0.58 0.82 0.65 0.11 0.02

PoissonRegressor 0.34 0.58 0.82 0.66 0.11 0.02

ElasticNetCV 0.34 0.57 0.83 0.66 0.11 0.43

LassoCV 0.34 0.57 0.83 0.66 0.11 0.54

LassoLarsCV 0.33 0.57 0.83 0.66 0.11 0.21

LinearSVR 0.31 0.56 0.84 0.65 0.11 0.23

LassoLarsIC 0.30 0.55 0.85 0.67 0.11 0.06

ExtraTreeRegressor 0.30 0.55 0.85 0.61 0.10 0.05

GammaRegressor 0.28 0.54 0.86 0.69 0.12 0.02

TweedieRegressor 0.28 0.54 0.86 0.70 0.12 0.02

DecisionTreeRegressor 0.26 0.53 0.87 0.61 0.10 0.11

LarsCV 0.21 0.50 0.90 0.73 0.12 0.16

OrthogonalMatchingPursuitCV 0.20 0.48 0.91 0.73 0.12 0.03

OrthogonalMatchingPursuit 0.20 0.48 0.91 0.73 0.12 0.02

PLSRegression 0.16 0.46 0.93 0.76 0.13 0.02

PassiveAggressiveRegressor -0.26 0.19 1.14 0.93 0.16 0.02

ElasticNet -0.27 0.19 1.14 0.99 0.16 0.02

LassoLars -0.57 -0.01 1.28 1.10 0.18 0.02

DummyRegressor -0.57 -0.01 1.28 1.10 0.18 0.02

Lasso -0.57 -0.01 1.28 1.10 0.18 0.02

RANSACRegressor -1.99 -0.92 1.76 1.33 0.22 0.38

Lars -6.56 -3.85 2.80 1.87 0.31 0.07

GaussianProcessRegressor -32.12 -20.25 5.85 5.38 0.86 0.58

KernelRidge -36.54 -23.09 6.23 6.18 1.02 0.10

2. Model from scratch


In [111…
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, NuSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB, BernoulliNB
class Regression_report:
def __init__(self, X_train, X_test, y_train, y_test, metric = None, create_d
self.X_train = X_train
self.X_test = X_test
self.y_train = y_train
self.y_test = y_test
self.metric = metric
self.df_compare_train = pd.DataFrame(columns =["R-squared", "Adjusted R-
self.df_compare_test = self.df_compare_train.copy()
self.create_df = create_df
self.lr = LinearRegression()
self.Rg = Ridge(alpha = 1, random_state = 42)
self.eln = ElasticNetCV(cv = KFold(5), random_state = 42)
self.PLS = PLSRegression(20)
self.knn = KNeighborsRegressor()
self.svr = SVR(kernel='rbf', gamma='scale', coef0=0.0, tol=0.001, C
self.dt = DecisionTreeRegressor(random_state=42)
self.rf = RandomForestRegressor(random_state=42)
self.ada = AdaBoostRegressor(random_state=42)
self.gbr = GradientBoostingRegressor(random_state=42)
self.xgb = XGBRegressor(random_state=42)
self.cb = CatBoostRegressor(random_state=42, verbose = 0)

def model(self):
self.regressors = [('Linear Regression', self.lr),('Ridge Regression'
('Decision Tree', self.dt), ('Random Forest', self.rf), ('AdaBoos
('Gradient Boosting Regressor', self.gbr), ('XGBoost', self.xgb

for self.name, self.estimator in self.regressors:


self.estimator.fit(self.X_train, self.y_train)
self.Report_metrics()

def adjusted_rsquared(r2, n, p):


return 1 - (1-r2) * ((n-1) / (n-p-1))

def Report_metrics(self):

self.P_train =self.estimator.predict(self.X_train)
self.P_test =self.estimator.predict(self.X_test)

if self.create_df==True:
r2_train = r2_score(self.y_train,self.P_train)
r2_test = r2_score(self.y_test,self.P_test)
r_squared_train = (1 - (1-r2_train) * ((self.X_train.shape[0]-1
r_squared_test = (1 - (1-r2_test) * ((self.X_test.shape[0]-1) /

ind =["R-squared", "Adjusted R-squared","RMSE","MAE","MedAE","MAPE"


self.metrics_df =pd.DataFrame(index=ind,data=[[r2_score (self
[r_squared_train,
[mean_squared_error
[mean_absolute_error
[median_absolute_error
[mean_absolute_percenta
[len(self.y_train)

#train
self.metrics_df["Estimator Name"]= self.name
df_compared_train = self.metrics_df.drop(['Test', "Estimator Name"
df_compared_train = df_compared_train.rename(columns ={'Train':
self.df_compare_train = self.df_compare_train.append(df_compared_tra

#test
self.metrics_df["Estimator Name"]= self.name
df_compared_test = self.metrics_df.drop(['Train', "Estimator Name"
df_compared_test = df_compared_test.rename(columns ={'Test': self
self.df_compare_test = self.df_compare_test.append(df_compared_test

else:
self.metrics_df ="File not created"

def Visualize_report(self):
while True:
try:
metric = int(input("Which metric do you want to visualize?\n\t
break
except:
print("Wrong metric! Please input number!")
for self.name, self.regressor in self.regressors:

# Fit regressor to the training set


self.regressor.fit(self.X_train, self.y_train)

# Predict
y_pred = self.regressor.predict(X_test)

# Evaluate performance on the test set


if metric == 1:
name = 'R_squared'
result = round(r2_score(self.y_test,y_pred),3)*100
if metric == 2:
name = 'Root mean squared error'
result = round(mean_squared_error(self.y_test,y_pred),3)
if metric == 3:
name = 'Mean absolute error'
result = round(mean_absolute_error(self.y_test,y_pred),3)
if metric == 4:
name = 'Median absolute error'
result = round(median_absolute_error(self.y_test,y_pred),3)
if metric == 5:
name = 'Mean absolute percentage error'
result = round(mean_absolute_percentage_error(self.y_test,y_pred

plt.rcParams["figure.figsize"] = (36,15)
ax = plt.bar(self.name,result)
plt.ylabel(name)
plt.xlabel("Algorithm")
plt.title(f"{name} compare", size = 20)

for p in ax.patches:
x = p.get_x()+ (p.get_width()/3)
y = p.get_height()+0.05
plt.text(x, y, round(result,3), fontsize=15)

In [112…
auto_models = Regression_report(X_train, X_test, y_train, y_test, create_df
auto_models.model()

In [113…
auto_models.Visualize_report()

In [115…
auto_models.df_compare_test

Out[115… R- Adjusted R- No. of


RMSE MAE MedAE MAPE
squared squared obs.

Linear Regression 0.58 0.35 0.82 0.65 0.54 0.11 347.00

Ridge Regression 0.59 0.36 0.81 0.65 0.54 0.11 347.00

Elastic Net 0.58 0.34 0.83 0.66 0.53 0.11 347.00

Partial Least Squares 0.59 0.36 0.81 0.64 0.52 0.11 347.00

K Nearest Neighbours 0.68 0.50 0.72 0.52 0.36 0.09 347.00

Support vector
0.71 0.56 0.68 0.50 0.39 0.09 347.00
machine

Decision Tree 0.53 0.26 0.87 0.61 0.44 0.10 347.00

Random Forest 0.73 0.58 0.66 0.49 0.38 0.08 347.00

AdaBoost 0.58 0.35 0.82 0.68 0.61 0.11 347.00

Gradient Boosting
0.70 0.53 0.70 0.54 0.44 0.09 347.00
Regressor

XGBoost 0.70 0.53 0.70 0.53 0.42 0.09 347.00

catboost 0.75 0.62 0.63 0.47 0.37 0.08 347.00


In [116…
auto_models.df_compare_train

Out[116… R- Adjusted R- No. of


RMSE MAE MedAE MAPE
squared squared obs.

Linear Regression 0.69 0.65 0.76 0.59 0.50 0.10 1376.00

Ridge Regression 0.69 0.65 0.76 0.59 0.50 0.10 1376.00

Elastic Net 0.67 0.64 0.78 0.61 0.52 0.11 1376.00

Partial Least Squares 0.68 0.65 0.76 0.59 0.50 0.10 1376.00

K Nearest Neighbours 0.81 0.79 0.59 0.42 0.31 0.07 1376.00

Support vector
0.84 0.83 0.53 0.35 0.18 0.06 1376.00
machine

Decision Tree 0.97 0.97 0.23 0.05 0.00 0.01 1376.00

Random Forest 0.94 0.93 0.33 0.22 0.15 0.04 1376.00

AdaBoost 0.67 0.63 0.78 0.66 0.65 0.11 1376.00

Gradient Boosting
0.84 0.83 0.53 0.41 0.33 0.07 1376.00
Regressor

XGBoost 0.97 0.97 0.23 0.07 0.02 0.01 1376.00

catboost 0.96 0.95 0.28 0.17 0.11 0.03 1376.00

Tunning SVM
In [117…
models = SVR()

In [141…
from sklearn.model_selection import GridSearchCV
cv = RepeatedKFold(5,3)
param_grid = {'C': [0.1,1, 10, 100], 'gamma': [1,0.1,0.01,0.001],'kernel':
grid = GridSearchCV(models,param_grid,refit=True,verbose=1, cv = cv)
grid.fit(X_train,y_train)

Fitting 15 folds for each of 48 candidates, totalling 720 fits


Out[141… GridSearchCV(cv=RepeatedKFold(n_repeats=3, n_splits=5, random_state=None),
estimator=SVR(),
param_grid={'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.
001],
'kernel': ['rbf', 'poly', 'sigmoid']},
verbose=1)

In [143…
print(grid.best_estimator_.C)
print(grid.best_estimator_.gamma)
print(grid.best_estimator_.kernel)

1
0.01
rbf
In [144…
#Before
svr = SVR()
svr.fit(X_train, y_train)
RMSE = mean_squared_error(y_test, svr.predict(X_test), squared = False)
MAPE = mean_absolute_percentage_error(y_test, svr.predict(X_test))
MAPE*100

Out[144… 8.547101079121475

In [145…
#After tunning
model_tuning = SVR(kernel = 'rbf', gamma = 0.01, C = 1)
model_tuning.fit(X_train, y_train)
RMSE = mean_squared_error(y_test, model_tuning.predict(X_test), squared = False
MAPE = mean_absolute_percentage_error(y_test, model_tuning.predict(X_test))
MAPE*100

Out[145… 8.449803139071971

In [ ]:

You might also like