0% found this document useful (0 votes)
10 views3 pages

ML Unit 3

The document discusses bootstrapping as a resampling method in statistics, explaining its application in creating smaller datasets for analysis. It introduces bagging and pasting as ensemble methods, highlighting the concept of out-of-bag scoring and the use of random forests for classification and regression tasks. Additionally, it provides a data preparation example using Python, including preprocessing steps and model training with grid search for hyperparameter tuning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views3 pages

ML Unit 3

The document discusses bootstrapping as a resampling method in statistics, explaining its application in creating smaller datasets for analysis. It introduces bagging and pasting as ensemble methods, highlighting the concept of out-of-bag scoring and the use of random forests for classification and regression tasks. Additionally, it provides a data preparation example using Python, including preprocessing steps and model training with grid search for hyperparameter tuning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 3

in statistics bootstrapping refrs to a resample method that consists of repeatedly

drawn with replacement samples from data to form other smaller datasets,called
bootstrapping samples. it's as if the bootstrapping method is a making abbunch of
simulations to our original dataset o its as if the bootstrapping method is making
a bunch of simulations to our original datasets so in some cases we can generalise
the mean and the standard deviation

For example lets say we have a set of observations:(2,4,32,8,16).we want each


bootstrap sample containing n obseRvations, the following are valid samples

n=3:(32,4,4),(8,16,2),(2,2,2)......
n=4(2,32,4,16),(2,4,2,8),(8,32,4,2).......

BAGGING& PASTING
Bagging means bootstrap +aggregating and it is a ensemble method in which we first
bootstap our data and for each bootstrap sample we train one model .After that,we
aggregate them with equal weights .When its not used replacement, method is called
pasting.
OUT-OF-Bag Scoring
If we are using bagging, theres chance that a sampe would never be selected, while
anothers may be selected multiple time .The probability of not selecting a
specijficsample isn(1-1/n)^n.Some samples are never tested but used in the model
this is called out of bag(OBB)

RANDOM FOREST
It is an ensemble of decision trees that can be used to classifications or
regression .In most voted becomes the output of the model . This is helpful to make
the model with the more accuracy and stable ,preventing overfiting.

Another very useful property of random forests is the ability to measure the
relative importance of each feature by calculating how much each one reduce the
impurity of the model .This is called feature importance.

DATA PREPERATION

import numpy as np
import pandas as pd
df = pd.read_csv('data/income.csv')
col = pd.Categorical(df.high_income)
df["high_income"]=col.codes
we define a transformer to pre process our data

from sklearn.base import BaseEstimator ,Transformermixin


from sklearn.preprocessing import MinMaxscalerclass
preprocessTransformer (BaseEstimator,TransformerMixin):
def__init__(self,cat_features,num_features):
self.cat_features=cat_features
self.num_features=num_features def fit(self,X,y=None):
return self def transform(self,x,y=None):
df=X.copy()
df.local[df['workclass']=='?',workclass']='Unknown'
df.local[df['native_country']!='united-states','native_country']='non_usa'
for name in self.cat_featurs:
col=pd.Categorical(df[name])
df[name]=col.codes
scaler=MinMaxScaler()
df[self.num_features]=scaler.fit_transform(df[num_features])
return df;

from sklearn.model_selection import train_test_split


X-train,X_test,y_train,y_test= train_test_split(
df.drop('high_income',axis=1)'
df['high_income'],
test_size=0.2,
random_state=42,
shuffle=True,
stratify=df['high_income']

first we create a pipeline to preprocess with our customer transformer

from sklearn.model_selection import KFold,GridsearchCV


from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,make_scorer
search_space=[
{
'clf,:[DecisionTreeClassifier()],
'clf__max_leaf_nodes':[128],
'fs__score_fun':[chi2],
'fs__k':[10],
},
{
'clf':[RandomForestClassifier()],
'clf__n_eatimators':[200],
'clf__max_leaf_nodes':[128],
'fs__score_func': [chi2],
'fs__k':[10],
}
]
scoring = {'AUC':'roc_auc','Accuracy':make_scorer(accuracy_score)}
Kfold=KFold(n_splits=10,random_state=42)
grid=GridSearchCV(
pipe,
param_grid=search_space,
cv=kflod,
scoring=scoring,
refit='AUC',
verbose=1,
n_jobs=-1
)
model=grid.fit(X_train,y_train)

best_estimator=grid.best_estimator_.steps[-1][1]
columns=x_tesst.columns.tolist()print('OOB Score:
{}'.format(best_estimator.obb_score_))
print('feature Importance')
for i,imp in enumerate(best_estimator.feature_importances_):
print('{}:{:.3f}'.format(columns[i],imp))
=>AveragePrice+LearningRate*ResidualPredicted by decision Tree

residual=Actual Value-Predicted Value

+>AveragePrice+LR*ResidualPredicted by DT1+LR*Residual Predicted by DT2+.....


+LR*ResidualPredicted by DT N

Amount of Say=1/2log(1-Total Error/Total Error)

Amount of Say=1/2log(1-1/4)
______=0.239
( 1/4)

Amount of Say
New Sample Weight=(sample weight)*e

0.239
New Sample Weight=(1/4)*e
-Amount of Say
New Sample Weight+(sample weight)*e

-0.239
New Sample Weight=(1/4)* e =0.197

Model Algorithm
BaseModel 1 Decision Tree
Basemodel 2 neural network
basemodel 3 support Vector Machine

Model Algorithm
Meta-Model Logistic Regression

model Algorithm
Base Model 1 Decision Tree
Basemodel 2 neural network
basemodel 3 support Vector Machine
Meta-Model Logistic Regression

You might also like