0% found this document useful (0 votes)

105 views

Advanced ML PDF

This document discusses loading and preprocessing an IPL dataset for regression analysis. It loads the dataset, encodes categorical features, standardizes features and the target variable, splits the data into train and test sets, builds linear regression and regularization models, and analyzes their performance. It also discusses dealing with imbalanced classification datasets by upsampling the minority class.

Uploaded by

sushanth

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

105 views

Advanced ML PDF

Uploaded by

sushanth

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Advanced Machine Learning

6.3 Advanced Regression Models

6.4.1.1 Loading IPL Dataset

ipl_auction_df = pd.read_csv( 'IPL IMB381IPL2013.csv' )
ipl_auction_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130 entries, 0 to 129
Data columns (total 26 columns):
Sl.NO. 130 non-null int64
PLAYER NAME 130 non-null object
AGE 130 non-null int64
COUNTRY 130 non-null object
TEAM 130 non-null object
PLAYING ROLE 130 non-null object
T-RUNS 130 non-null int64
T-WKTS 130 non-null int64
ODI-RUNS-S 130 non-null int64
ODI-SR-B 130 non-null float64
ODI-WKTS 130 non-null int64
ODI-SR-BL 130 non-null float64
CAPTAINCY EXP 130 non-null int64
RUNS-S 130 non-null int64
HS 130 non-null int64
AVE 130 non-null float64
SR-B 130 non-null float64
SIXERS 130 non-null int64
RUNS-C 130 non-null int64
WKTS 130 non-null int64
AVE-BL 130 non-null float64
ECON 130 non-null float64
SR-BL 130 non-null float64
AUCTION YEAR 130 non-null int64
BASE PRICE 130 non-null int64
SOLD PRICE 130 non-null int64
dtypes: float64(7), int64(15), object(4)
memory usage: 26.5+ KB

X_features = ['AGE', 'COUNTRY', 'PLAYING ROLE',

'T-RUNS', 'T-WKTS', 'ODI-RUNS-S', 'ODI-SR-B',
'ODI-WKTS', 'ODI-SR-BL', 'CAPTAINCY EXP', 'RUNS-S',
'HS', 'AVE', 'SR-B', 'SIXERS', 'RUNS-C', 'WKTS',
'AVE-BL', 'ECON', 'SR-BL']

# categorical_features is initialized with the categorical variable names.

categorical_features = ['AGE', 'COUNTRY', 'PLAYING ROLE', 'CAPTAINCY EXP']
#get_dummies() is invoked to return the dummy features.
ipl_auction_encoded_df = pd.get_dummies( ipl_auction_df[X_features],
columns = categorical_features,
drop_first = True )
ipl_auction_encoded_df.columns
Index(['T-RUNS', 'T-WKTS', 'ODI-RUNS-S', 'ODI-SR-B', 'ODI-WKTS', 'OD
I-SR-BL',
'RUNS-S', 'HS', 'AVE', 'SR-B', 'SIXERS', 'RUNS-C', 'WKTS', 'A
VE-BL',
'ECON', 'SR-BL', 'AGE_2', 'AGE_3', 'COUNTRY_BAN', 'COUNTRY_EN
G',
'COUNTRY_IND', 'COUNTRY_NZ', 'COUNTRY_PAK', 'COUNTRY_SA', 'CO
UNTRY_SL',
'COUNTRY_WI', 'COUNTRY_ZIM', 'PLAYING ROLE_Batsman',
'PLAYING ROLE_Bowler', 'PLAYING ROLE_W. Keeper', 'CAPTAINCY E
XP_1'],
dtype='object')

X = ipl_auction_encoded_df
Y = ipl_auction_df['SOLD PRICE']

6.4.1.2 Standardize X & Y

from sklearn.preprocessing import StandardScaler

## Initializing the StandardScaler

X_scaler = StandardScaler()
## Standardize all the feature columns
X_scaled = X_scaler.fit_transform(X)

## Standardizing Y explictly by subtracting mean and

## dividing by standard deviation
Y = (Y - Y.mean()) / Y.std()
/Users/manaranjan/anaconda/lib/python3.5/site-packages/sklearn/prepr
ocessing/data.py:617: DataConversionWarning: Data with input dtype u
int8, int64, float64 were all converted to float64 by StandardScale
r.
return self.partial_fit(X, y)
/Users/manaranjan/anaconda/lib/python3.5/site-packages/sklearn/base.
py:462: DataConversionWarning: Data with input dtype uint8, int64, f
loat64 were all converted to float64 by StandardScaler.
return self.fit(X, **fit_params).transform(X)

6.4.1.3 Split the dataset into train and test

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(

X_scaled,
Y,
test_size=0.2,
random_state = 42)

6.4.1.4 Build the model

from sklearn.linear_model import LinearRegression

linreg = LinearRegression()
linreg.fit(X_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)

linreg.coef_

array([-0.43539611, -0.04632556, 0.50840867, -0.03323988, 0.222037

7 ,
-0.05065703, 0.17282657, -0.49173336, 0.58571405, -0.116547

53,
0.24880095, 0.09546057, 0.16428731, 0.26400753, -0.082533
41,
-0.28643889, -0.26842214, -0.21910913, -0.02622351, 0.248178
98,
0.18760332, 0.10776084, 0.04737488, 0.05191335, 0.012352
45,
0.00547115, -0.03124706, 0.08530192, 0.01790803, -0.050774
54,
0.18745577])

## The dataframe has two columns to store feature name

## and the corresponding coefficient values
columns_coef_df = pd.DataFrame( { 'columns': ipl_auction_encoded_df.columns,
'coef': linreg.coef_ } )
## Sorting the features by coefficient values in descending order
sorted_coef_vals = columns_coef_df.sort_values( 'coef', ascending=False)

6.4.1.5 Plotting the coeﬃcient values

plt.figure( figsize = ( 8, 6 ))
## Creating a bar plot
sn.barplot(x="coef", y="columns",
data=sorted_coef_vals);
plt.xlabel("Coefficients from Linear Regression")
plt.ylabel("Features")

Text(0,0.5,'Features')

6.4.1.6 Calculate R-Squared value

from sklearn import metrics

# Takes a model as a parameter

# Prints the RMSE on train and test set
def get_train_test_rmse( model ):
# Predicting on training dataset
y_train_pred = model.predict( X_train )
# Compare the actual y with predicted y in the training dataset
rmse_train = round(np.sqrt(metrics.mean_squared_error( y_train, y_train_pred
)), 3)
# Predicting on test dataset
y_test_pred = model.predict( X_test )
# Compare the actual y with predicted y in the test dataset
rmse_test = round(np.sqrt(metrics.mean_squared_error( y_test, y_test_pred
)), 3)
print( "train: ", rmse_train, " test:", rmse_test )

get_train_test_rmse( linreg )
train: 0.679 test: 0.749

6.4.2 Applying Regularization

6.4.2.1 Ridge Regression

# Importing Ridge Regression

from sklearn.linear_model import Ridge

# Applying alpha = 1 and running the algorithms for maximum of 500 iterations
ridge = Ridge(alpha = 1, max_iter = 500)
ridge.fit( X_train, y_train )

Ridge(alpha=1, copy_X=True, fit_intercept=True, max_iter=500, normal

ize=False,
random_state=None, solver='auto', tol=0.001)

get_train_test_rmse( ridge )
train: 0.68 test: 0.724

ridge = Ridge(alpha = 2.0, max_iter = 1000)

ridge.fit( X_train, y_train )
get_train_test_rmse( ridge )
train: 0.682 test: 0.706

6.4.2.2 Lasso Regression

# Importing Ridge Regression
from sklearn.linear_model import Lasso

# Applying alpha = 1 and running the algorithms for maximum of 500 iterations
lasso = Lasso(alpha = 0.01, max_iter = 500)
lasso.fit( X_train, y_train )
Lasso(alpha=0.01, copy_X=True, fit_intercept=True, max_iter=500,
normalize=False, positive=False, precompute=False, random_state=N
one,
selection='cyclic', tol=0.0001, warm_start=False)

get_train_test_rmse( lasso )
train: 0.688 test: 0.698

## Storing the feature names and coefficient values in the DataFrame

lasso_coef_df = pd.DataFrame( { 'columns':
ipl_auction_encoded_df.columns,
'coef':
lasso.coef_ } )

## Filtering out coefficients with zeros

lasso_coef_df[lasso_coef_df.coef == 0]

coef columns

1 -0.0 T-WKTS

3 -0.0 ODI-SR-B

13 -0.0 AVE-BL

28 0.0 PLAYING ROLE_Bowler

6.4.2.3 Elastic Net Regression

0.01/1.01
0.009900990099009901

from sklearn.linear_model import ElasticNet

enet = ElasticNet(alpha = 1.01, l1_ratio = 0.0099, max_iter = 500)

enet.fit( X_train, y_train )
get_train_test_rmse( enet )
train: 0.794 test: 0.674

6.4 More Advanced Algorithms

bank_df = pd.read_csv( 'bank.csv')
bank_df.head(5)

housing- personal- curre

age job marital education default balance
loan loan campa

0 30 unemployed married primary no 1787 no no 1

1 33 services married secondary no 4789 yes yes 1

2 35 management single tertiary no 1350 yes no 1

3 30 management married tertiary no 1476 yes yes 4

4 59 blue-collar married secondary no 0 yes no 1

bank_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4521 entries, 0 to 4520
Data columns (total 11 columns):
age 4521 non-null int64
job 4521 non-null object
marital 4521 non-null object
education 4521 non-null object
default 4521 non-null object
balance 4521 non-null int64
housing-loan 4521 non-null object
personal-loan 4521 non-null object
current-campaign 4521 non-null int64
previous-campaign 4521 non-null int64
subscribed 4521 non-null object
dtypes: int64(4), object(7)
memory usage: 388.6+ KB

6.4.1 Dealing with imbalanced datasets

bank_df.subscribed.value_counts()
no 4000
yes 521
Name: subscribed, dtype: int64
## Importing resample from *sklearn.utils* package.
from sklearn.utils import resample

# Separate the case of yes-subscribes and no-subscribes

bank_subscribed_no = bank_df[bank_df.subscribed == 'no']
bank_subscribed_yes = bank_df[bank_df.subscribed == 'yes']

##Upsample the yes-subscribed cases.

df_minority_upsampled = resample(bank_subscribed_yes,
replace=True, # sample with replacement
n_samples=2000)

# Combine majority class with upsampled minority class

new_bank_df = pd.concat([bank_subscribed_no, df_minority_upsampled])

from sklearn.utils import shuffle

new_bank_df = shuffle(new_bank_df)

# Assigning list of all column names in the DataFrame

X_features = list( new_bank_df.columns )
# Remove the response variable from the list
X_features.remove( 'subscribed' )
X_features
['age',
'job',
'marital',
'education',
'default',
'balance',
'housing-loan',
'personal-loan',
'current-campaign',
'previous-campaign']

## get_dummies() will convert all the columns with data type as objects
encoded_bank_df = pd.get_dummies( new_bank_df[X_features], drop_first = True )
X = encoded_bank_df

# Encoding the subscribed column and assigning to Y

Y = new_bank_df.subscribed.map( lambda x: int( x == 'yes') )

from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split( X,

Y,
test_size = 0.3,
random_state = 42 )

6.4.2 Logistic Regression model

6.4.2.1 Building the model

from sklearn.linear_model import LogisticRegression

## Initializing the model

logit = LogisticRegression()
## Fitting the model with X and Y values of the dataset
logit.fit( train_X, train_y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_interce

pt=True,
intercept_scaling=1, max_iter=100, multi_class='warn',
n_jobs=None, penalty='l2', random_state=None, solver='war
n',
tol=0.0001, verbose=0, warm_start=False)

pred_y = logit.predict(test_X)

6.4.2.2 Confusion Matrix

## Importing the metrics

from sklearn import metrics

## Defining the matrix to draw the confusion metrix from actual and predicted cl
ass labels
def draw_cm( actual, predicted ):
# Invoking confusion_matrix from metric package. The matrix will oriented as
[1,0] i.e.
# the classes with label 1 will be reprensted the first row and 0 as second
row
cm = metrics.confusion_matrix( actual, predicted, [1,0] )
## Confustion will be plotted as heatmap for better visualization
## The lables are configured to better interpretation from the plot
sn.heatmap(cm, annot=True, fmt='.2f',
xticklabels = ["Subscribed", "Not Subscribed"] ,
yticklabels = ["Subscribed", "Not Subscribed"] )
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

cm = draw_cm( test_y, pred_y )

cm
6 5 2 3 Classiﬁcation Report
print( metrics.classification_report( test_y, pred_y ) )

precision recall f1-score support

0 0.73 0.92 0.81 1225

1 0.60 0.27 0.37 575

micro avg 0.71 0.71 0.71 1800

macro avg 0.66 0.59 0.59 1800
weighted avg 0.69 0.71 0.67 1800

6.5.2.4 ROC AUC Score

## Predicting the probability values for test cases

predict_proba_df = pd.DataFrame( logit.predict_proba( test_X ) )
predict_proba_df.head()

0 1

0 0.704479 0.295521

1 0.853664 0.146336

2 0.666963 0.333037

3 0.588329 0.411671

4 0.707982 0.292018

## Initializing the DataFrame with actual class lables

test_results_df = pd.DataFrame( { 'actual': test_y } )
test_results_df = test_results_df.reset_index()
## Assigning the probability values for class label 1
test_results_df['chd_1'] = predict_proba_df.iloc[:,1:2]

test_results_df.head(5)

index actual chd_1

0 1321 0 0.295521

1 3677 0 0.146336

2 1680 1 0.333037

3 821 0 0.411671

4 921 0 0.292018
# Passing actual class labels and the predicted probability values to compute RO
C AUC score.
auc_score = metrics.roc_auc_score( test_results_df.actual, test_results_df.chd_1
)
round( float( auc_score ), 2 )

0.69

## The method takes the three following parameters

## model: the classification model
## test_X: X features of the test set
## test_y: actual labels of the test set
## Returns
## - ROC Auc Score
## - FPR and TPRs for different threshold values
def draw_roc_curve( model, test_X, test_y ):
## Creating and initializing a results DataFrame with actual labels
test_results_df = pd.DataFrame( { 'actual': test_y } )
test_results_df = test_results_df.reset_index()

# predict the probabilities on the test set

predict_proba_df = pd.DataFrame( model.predict_proba( test_X ) )

## selecting the probabilities that the test example belongs to class 1

test_results_df['chd_1'] = predict_proba_df.iloc[:,1:2]

## Invoke roc_curve() to return the fpr, tpr and threshold values.

## threshold values contain values from 0.0 to 1.0
fpr, tpr, thresholds = metrics.roc_curve( test_results_df.actual,
test_results_df.chd_1,
drop_intermediate = False )

## Getting the roc auc score by invoking metrics.roc_auc_score method

auc_score = metrics.roc_auc_score( test_results_df.actual, test_results_df.c
hd_1 )

## Setting the size of the plot

plt.figure(figsize=(8, 6))
## plotting the actual fpr and tpr values
plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
## plotting th diagnoal line from (0,1)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
## Setting labels and titles
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

return auc_score, fpr, tpr, thresholds

## Invoking draw_roc_curve with the logistic regresson model
_, _, _, _ = draw_roc_curve( logit, test_X, test_y )

6.5.3 KNN Algorithm

## Importing the KNN classifier algorithm

from sklearn.neighbors import KNeighborsClassifier

## Initializing the classifier

knn_clf = KNeighborsClassifier()
## Fitting the model with the training set
knn_clf.fit( train_X, train_y )
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkows
ki',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')

6.5.3.1 KNN Accuracy

## Invoking draw_roc_curve with the KNN model
_, _, _, _ = draw_roc_curve( knn_clf, test_X, test_y )

## Predicting on test set

pred_y = knn_clf.predict(test_X)
## Drawing the confusion matrix for KNN model
draw_cm( test_y, pred_y )
print( metrics.classification_report( test_y, pred_y ) )

precision recall f1-score support

0 0.85 0.77 0.81 1225

1 0.59 0.72 0.65 575

micro avg 0.75 0.75 0.75 1800

macro avg 0.72 0.74 0.73 1800
weighted avg 0.77 0.75 0.76 1800

6.5.3.2 GridSerach for most optimal parameters

## Importing GridSearchCV
from sklearn.model_selection import GridSearchCV

## Creating a dictionary with hyperparameters and possible values for searching

tuned_parameters = [{'n_neighbors': range(5,10),
'metric': ['canberra', 'euclidean', 'minkowski']}]

## Configuring grid search

clf = GridSearchCV(KNeighborsClassifier(),
tuned_parameters,
cv=10,
scoring='roc_auc')
## fit the search with training set
clf.fit(train_X, train_y )

GridSearchCV(cv=10, error_score='raise-deprecating',
estimator=KNeighborsClassifier(algorithm='auto', leaf_size=3
0, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform'),
fit_params=None, iid='warn', n_jobs=None,
param_grid=[{'n_neighbors': range(5, 10), 'metric': ['canberr
a', 'euclidean', 'minkowski']}],
pre_dispatch='2*n_jobs', refit=True, return_train_score='war
n',
scoring='roc_auc', verbose=0)

clf.best_score_
0.8368537419503068

clf.best_params_
{'metric': 'canberra', 'n_neighbors': 5}

6.5.4 Ensemble Methods

6.5.5 Random Forest

6.5.5.1 Buiding Random Forest Model

## Importing Random Forest Classifier from the sklearn.ensemble
from sklearn.ensemble import RandomForestClassifier

## Initializing the Random Forest Classifier with max_dept and n_estimators

radm_clf = RandomForestClassifier( max_depth=10, n_estimators=10)
radm_clf.fit( train_X, train_y )

RandomForestClassifier(bootstrap=True, class_weight=None, criterion

='gini',
max_depth=10, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=No
ne,
oob_score=False, random_state=None, verbose=0,
warm_start=False)

_, _, _, _ = draw_roc_curve( radm_clf, test_X, test_y );

6.5.5.2 Grid Search for Optimal Parameters

## Configuring parameters and values for searched
tuned_parameters = [{'max_depth': [10, 15],
'n_estimators': [10,20],
'max_features': ['sqrt', 'auto']}]

## Initializing the RF classifier

radm_clf = RandomForestClassifier()

## Configuring search with the tunable parameters

clf = GridSearchCV(radm_clf,
tuned_parameters,
cv=5,
scoring='roc_auc')

## Fitting the training set

clf.fit(train_X, train_y )

GridSearchCV(cv=5, error_score='raise-deprecating',
estimator=RandomForestClassifier(bootstrap=True, class_weight
=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=Non
e,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators='warn', n_job
s=None,
oob_score=False, random_state=None, verbose=0,
warm_start=False),
fit_params=None, iid='warn', n_jobs=None,
param_grid=[{'n_estimators': [10, 20], 'max_depth': [10, 15],
'max_features': ['sqrt', 'auto']}],
pre_dispatch='2*n_jobs', refit=True, return_train_score='war
n',
scoring='roc_auc', verbose=0)

clf.best_score_
0.9399595384858543

clf.best_params_
{'max_depth': 15, 'max_features': 'auto', 'n_estimators': 20}

6.5.5.3 Building the ﬁnal model with optimal parameter values

## Initializing the Random Forest Mode with the optimal values
radm_clf = RandomForestClassifier( max_depth=15, n_estimators=20, max_features =
'auto')
## Fitting the model with the training set
radm_clf.fit( train_X, train_y )

RandomForestClassifier(bootstrap=True, class_weight=None, criterion

='gini',
max_depth=15, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=No
ne,
oob_score=False, random_state=None, verbose=0,
warm_start=False)

6.5.5.4 ROC AUC Score

_, _, _, _ = draw_roc_curve( clf, test_X, test_y )

6.5.5.5 Drawing the confusion matrix

pred_y = radm_clf.predict( test_X )
draw_cm( test_y, pred_y )

print( metrics.classification_report( test_y, pred_y ) )

precision recall f1-score support

0 0.90 0.94 0.92 1225

1 0.86 0.78 0.82 575

micro avg 0.89 0.89 0.89 1800

macro avg 0.88 0.86 0.87 1800
weighted avg 0.89 0.89 0.89 1800

6.5.5.6 Finding important features

import numpy as np

# Create a dataframe to store the featues and their corresponding importances

feature_rank = pd.DataFrame( { 'feature': train_X.columns,
'importance': radm_clf.feature_importances_ } )

## Sorting the features based on their importances with most important feature a
t top.
feature_rank = feature_rank.sort_values('importance', ascending = False)

plt.figure(figsize=(8, 6))
# plot the values
sn.barplot( y = 'feature', x = 'importance', data = feature_rank );
feature_rank['cumsum'] = feature_rank.importance.cumsum() * 100
feature_rank.head(10)

feature importance cumsum

1 balance 0.269603 26.960282

0 age 0.203664 47.326707

3 previous-campaign 0.117525 59.079219

2 current-campaign 0.090085 68.087703

21 housing-loan_yes 0.039898 72.077486

15 marital_married 0.034329 75.510337

22 personal-loan_yes 0.027029 78.213244

17 education_secondary 0.023934 80.606690

4 job_blue-collar 0.023081 82.914811

16 marital_single 0.022495 85.164357

6.5.6 Boosting

6.5.6.1 Adaboost

## Importing Adaboost classifier

from sklearn.ensemble import AdaBoostClassifier

## Initializing logistic regression to use as base classifier

logreg_clf = LogisticRegression()

## Initilizing adaboost classifier with 50 classifers

ada_clf = AdaBoostClassifier(logreg_clf, n_estimators=50)

## Fitting adaboost model to training set

ada_clf.fit(train_X, train_y )
AdaBoostClassifier(algorithm='SAMME.R',
base_estimator=LogisticRegression(C=1.0, class_weight=Non
e, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='warn',
n_jobs=None, penalty='l2', random_state=None, solver='war
n',
tol=0.0001, verbose=0, warm_start=False),
learning_rate=1.0, n_estimators=50, random_state=None)
_, _, _, _ = draw_roc_curve( ada_clf, test_X, test_y )

6.5.6.2 Gradient Boosting

## Importing Gradient Boosting classifier

from sklearn.ensemble import GradientBoostingClassifier

## Initializing Gradient Boosting with 500 estimators and max depth as 10.
gboost_clf = GradientBoostingClassifier( n_estimators=500, max_depth=10)

## Fitting gradient boosting model to training set

gboost_clf.fit(train_X, train_y )
GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=10,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=500,
n_iter_no_change=None, presort='auto', random_state=No
ne,
subsample=1.0, tol=0.0001, validation_fraction=0.1,
verbose=0, warm_start=False)
_, _, _, _ = draw_roc_curve( gboost_clf, test_X, test_y )

from sklearn.model_selection import cross_val_score

gboost_clf = GradientBoostingClassifier( n_estimators=500, max_depth=10)

cv_scores = cross_val_score( gboost_clf, train_X, train_y, cv = 10, scoring = 'r
oc_auc' )

print( cv_scores )
print( "Mean Accuracy: ", np.mean(cv_scores), " with standard deviation of: ",
np.std(cv_scores))
[0.98241686 0.98105851 0.98084469 0.9585199 0.95482216 0.96667006
0.95342452 0.97368689 0.95937357 0.98174607]
Mean Accuracy: 0.969256322542174 with standard deviation of: 0.01
1406249012935668
gboost_clf.fit(train_X, train_y )
pred_y = gboost_clf.predict( test_X )
draw_cm( test_y, pred_y )

print( metrics.classification_report( test_y, pred_y ) )

precision recall f1-score support

0 0.96 0.95 0.96 1225

1 0.90 0.92 0.91 575

micro avg 0.94 0.94 0.94 1800

macro avg 0.93 0.94 0.94 1800
weighted avg 0.94 0.94 0.94 1800
import numpy as np

# Create a dataframe to store the featues and their corresponding importances

feature_rank = pd.DataFrame( { 'feature': train_X.columns,
'importance': gboost_clf.feature_importances_ } )

## Sorting the features based on their importances with most important feature a
t top.
feature_rank = feature_rank.sort_values('importance', ascending = False)

plt.figure(figsize=(8, 6))
# plot the values
sn.barplot( y = 'feature', x = 'importance', data = feature_rank );

Regression Analysis - Cheatsheet
No ratings yet
Regression Analysis - Cheatsheet
9 pages
Zero Inflated Models and Generalized Linear Mixed Models With R PDF
80% (5)
Zero Inflated Models and Generalized Linear Mixed Models With R PDF
342 pages
Correlation - Pearson Product Moment
No ratings yet
Correlation - Pearson Product Moment
17 pages
ML LAB FILE (2)
No ratings yet
ML LAB FILE (2)
48 pages
Data Science Record_05
No ratings yet
Data Science Record_05
20 pages
AIML PRACTICALS
No ratings yet
AIML PRACTICALS
22 pages
MachineLearning
No ratings yet
MachineLearning
10 pages
Train
No ratings yet
Train
17 pages
ML EXTERNAL XEROX
No ratings yet
ML EXTERNAL XEROX
1 page
05 E RandomForest LoanData
No ratings yet
05 E RandomForest LoanData
8 pages
Exp_6-Model Development_sdk_ok
No ratings yet
Exp_6-Model Development_sdk_ok
11 pages
DA_012307
No ratings yet
DA_012307
8 pages
MLfull
No ratings yet
MLfull
29 pages
DA Practicle Answers Easyw
No ratings yet
DA Practicle Answers Easyw
30 pages
Udacity Machine Learning Analysis Supervised Learning
100% (1)
Udacity Machine Learning Analysis Supervised Learning
504 pages
Group Work Assignment Supervised and Unsupervised Learning
No ratings yet
Group Work Assignment Supervised and Unsupervised Learning
10 pages
Btech1007022_lab5.1
No ratings yet
Btech1007022_lab5.1
9 pages
Sofcomputing Da2
No ratings yet
Sofcomputing Da2
7 pages
Mlext
No ratings yet
Mlext
1 page
Data Mining Practicals
No ratings yet
Data Mining Practicals
22 pages
Btech1007022_lab5
No ratings yet
Btech1007022_lab5
14 pages
Data analytics
No ratings yet
Data analytics
10 pages
ML Lab Programs (1)
No ratings yet
ML Lab Programs (1)
9 pages
LAB5_Regularization
No ratings yet
LAB5_Regularization
6 pages
IoT Task4 21BEC0384
No ratings yet
IoT Task4 21BEC0384
9 pages
ml_6_7_8 (1)
No ratings yet
ml_6_7_8 (1)
10 pages
Zerox Ready
No ratings yet
Zerox Ready
21 pages
ML Shristi File
No ratings yet
ML Shristi File
49 pages
Know Your Dataset: Season Holiday Weekday Workingday CNT 726 727 728 729 730
No ratings yet
Know Your Dataset: Season Holiday Weekday Workingday CNT 726 727 728 729 730
1 page
ml
No ratings yet
ml
17 pages
Chapter 4 - Linear Regression
100% (2)
Chapter 4 - Linear Regression
25 pages
Machine Learnin
100% (2)
Machine Learnin
23 pages
Machine learning lab manual
No ratings yet
Machine learning lab manual
22 pages
SML - Week 3
No ratings yet
SML - Week 3
5 pages
ML-Lab07-Building and Evaluating Multivariate Regression Models in Python
No ratings yet
ML-Lab07-Building and Evaluating Multivariate Regression Models in Python
5 pages
23BCE7092_ML_Lab_Assignment[1]
No ratings yet
23BCE7092_ML_Lab_Assignment[1]
14 pages
Mlaifile1 3
No ratings yet
Mlaifile1 3
27 pages
SiddharthShah 1032221195 DivC 50 DL LabAssignment2
No ratings yet
SiddharthShah 1032221195 DivC 50 DL LabAssignment2
7 pages
ML Activity Kalyan
No ratings yet
ML Activity Kalyan
21 pages
Exercise4 Solution
No ratings yet
Exercise4 Solution
20 pages
LGB Regressor
No ratings yet
LGB Regressor
3 pages
hemraj_python_ass1
No ratings yet
hemraj_python_ass1
7 pages
Lab 1. Boston House
No ratings yet
Lab 1. Boston House
7 pages
DSML
No ratings yet
DSML
9 pages
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
No ratings yet
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
20 pages
21CSC305P Ml - Lab Programs 1 -9
No ratings yet
21CSC305P Ml - Lab Programs 1 -9
36 pages
Machine Learning Strategies
No ratings yet
Machine Learning Strategies
59 pages
MlLabManualdocx 2024 09 04 22 02 58
No ratings yet
MlLabManualdocx 2024 09 04 22 02 58
19 pages
House Price Prediction Using Machine Learning in Python
No ratings yet
House Price Prediction Using Machine Learning in Python
13 pages
Import Pandas As PD DF PD - Read - CSV ("Titanic - Train - CSV") DF - Head
No ratings yet
Import Pandas As PD DF PD - Read - CSV ("Titanic - Train - CSV") DF - Head
20 pages
DA_Programs
No ratings yet
DA_Programs
44 pages
ML_recordjp
No ratings yet
ML_recordjp
35 pages
Gaurav - Data Mining Lab Assignment
No ratings yet
Gaurav - Data Mining Lab Assignment
36 pages
DA Assignment
No ratings yet
DA Assignment
18 pages
ML INTERNAL ANSWERS
No ratings yet
ML INTERNAL ANSWERS
9 pages
Machine Learning Hands-On
100% (1)
Machine Learning Hands-On
18 pages
1st PGM
No ratings yet
1st PGM
10 pages
TP.ipynb - Colab
No ratings yet
TP.ipynb - Colab
6 pages
Chapter 5 - Classification Problems
100% (1)
Chapter 5 - Classification Problems
25 pages
Data Analytics Program
No ratings yet
Data Analytics Program
11 pages
ML Lab Prgms Split
No ratings yet
ML Lab Prgms Split
3 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Sec B Groups
No ratings yet
Sec B Groups
2 pages
Introduction To Python: 1.1 Declaring Variables
No ratings yet
Introduction To Python: 1.1 Declaring Variables
9 pages
Based On Your Analyses, What Strategies Would You Recommend To MEML?
No ratings yet
Based On Your Analyses, What Strategies Would You Recommend To MEML?
1 page
Classification Problems
100% (1)
Classification Problems
25 pages
By Conducting A PESTEL Analysis, Assess How The General Environment Would Impact The Strategies of Mahindra Electric Mobility Limited (MEML)
No ratings yet
By Conducting A PESTEL Analysis, Assess How The General Environment Would Impact The Strategies of Mahindra Electric Mobility Limited (MEML)
1 page
MEML3
No ratings yet
MEML3
1 page
Decision Trees 2 PDF
No ratings yet
Decision Trees 2 PDF
39 pages
Decision Trees 2 PDF
No ratings yet
Decision Trees 2 PDF
39 pages
BCA - 240-18, 240-20, BCA-CS-240-20 Statistics
No ratings yet
BCA - 240-18, 240-20, BCA-CS-240-20 Statistics
3 pages
Lecture 2 Experimental Research
No ratings yet
Lecture 2 Experimental Research
6 pages
Financial Time Series Forecasting Using Independent Component Analysis and Support Vector Regression
No ratings yet
Financial Time Series Forecasting Using Independent Component Analysis and Support Vector Regression
11 pages
4th Periodical
No ratings yet
4th Periodical
4 pages
Box Plots and Distribution
No ratings yet
Box Plots and Distribution
14 pages
Solution
No ratings yet
Solution
2 pages
Descriptive Statistics For The Variables in The Data: Panel A. Whether They Are in Treatment or Control Group
No ratings yet
Descriptive Statistics For The Variables in The Data: Panel A. Whether They Are in Treatment or Control Group
8 pages
16111-Article Text-44769-1-10-20211224 (2021) PDF
No ratings yet
16111-Article Text-44769-1-10-20211224 (2021) PDF
13 pages
Choosing Appropriate Descriptive Statistics, Graphs and Statistical Tests
No ratings yet
Choosing Appropriate Descriptive Statistics, Graphs and Statistical Tests
47 pages
STR Jmulti
No ratings yet
STR Jmulti
17 pages
Direct Manpower Manhour - Rev
No ratings yet
Direct Manpower Manhour - Rev
1 page
rr311801 Probability and Statistics
No ratings yet
rr311801 Probability and Statistics
9 pages
Gold Price Volatility in India
No ratings yet
Gold Price Volatility in India
8 pages
XG Boost
No ratings yet
XG Boost
39 pages
Simple Linear Regression Analysis Group 3
No ratings yet
Simple Linear Regression Analysis Group 3
13 pages
Business Statistics & Analytics For Decision Making Assignment 1 Franklin Babu
100% (1)
Business Statistics & Analytics For Decision Making Assignment 1 Franklin Babu
9 pages
The Normal Distribution
No ratings yet
The Normal Distribution
20 pages
Chapter10 Sol PDF
No ratings yet
Chapter10 Sol PDF
13 pages
Full download Modern Statistics with R From Wrangling and Exploring Data to Inference and Predictive Modelling Second Edition Måns Thulin pdf docx
100% (2)
Full download Modern Statistics with R From Wrangling and Exploring Data to Inference and Predictive Modelling Second Edition Måns Thulin pdf docx
71 pages
Refer To Brand Preference Problem 6 5 A Obtain The Studentized Deleted Residuals and Identify PDF
No ratings yet
Refer To Brand Preference Problem 6 5 A Obtain The Studentized Deleted Residuals and Identify PDF
2 pages
LU2 - A - One-Sample and Paired Sign Test
0% (1)
LU2 - A - One-Sample and Paired Sign Test
19 pages
Semester Test 1 Memo
No ratings yet
Semester Test 1 Memo
12 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
Determination of The Aluminium Content in Different Brands of Deodor
No ratings yet
Determination of The Aluminium Content in Different Brands of Deodor
14 pages
Module 34. Analysis of Variance (ANOVA) PDF
No ratings yet
Module 34. Analysis of Variance (ANOVA) PDF
89 pages
Unit 4
No ratings yet
Unit 4
26 pages
Cheat Sheet - SSP1
No ratings yet
Cheat Sheet - SSP1
10 pages
Chapter 3 Numerical Descriptive Measures
No ratings yet
Chapter 3 Numerical Descriptive Measures
65 pages