0% found this document useful (0 votes)
530 views1 page

Caret PDF

The caret package in R can be used for preprocessing data, specifying models, tuning hyperparameters, and evaluating model performance. Preprocessing methods like centering, scaling, imputation can be applied. Models are specified with the train function using either a formula interface or x/y matrices. Resampling methods like cross-validation and bootstrapping are configured using trainControl. Hyperparameters can be tuned via grid search over a specified grid or random search over a parameter range. Performance is summarized using functions like defaultSummary or custom functions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
530 views1 page

Caret PDF

The caret package in R can be used for preprocessing data, specifying models, tuning hyperparameters, and evaluating model performance. Preprocessing methods like centering, scaling, imputation can be applied. Models are specified with the train function using either a formula interface or x/y matrices. Resampling methods like cross-validation and bootstrapping are configured using trainControl. Hyperparameters can be tuned via grid search over a specified grid or random search over a parameter range. Performance is summarized using functions like defaultSummary or custom functions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Preprocessing Performance Metrics

caret Package Transformations, filters, and other operations can be applied to To choose how to summarize a model, the trainControl

Cheat Sheet the predictors with the preProc option. function is used again.

train(, preProc = c("method1", "method2"), ...) trainControl(summaryFunction = <R function>,

Specifying the Model Methods include:


classProbs = <logical>)
Custom R functions can be used but caret includes several:
defaultSummary (for accuracy, RMSE, etc), twoClassSummary
Possible syntaxes for specifying the variables in the model: • "center", "scale", and "range" to normalize predictors.
(for ROC curves), and prSummary (for information retrieval). For
• "BoxCox", "YeoJohnson", or "expoTrans" to transform
train(y ~ x1 + x2, data = dat, ...) the last two functions, the option classProbs must be set to
predictors.
train(x = predictor_df, y = outcome_vector, ...) TRUE.
• "knnImpute", "bagImpute", or "medianImpute" to
train(recipe_object, data = dat, ...)
impute.
• rfe, sbf, gafs, and safs only have the x/y interface. • "corr", "nzv", "zv", and "conditionalX" to filter. Grid Search
• The train formula method will always create dummy • "pca", "ica", or "spatialSign" to transform groups.
To let train determine the values of the tuning parameter(s), the
variables.
train determines the order of operations; the order that the tuneLength option controls how many values per tuning
• The x/y interface to train will not create dummy variables methods are declared does not matter. parameter to evaluate.
(but the underlying model function might).
Remember to: The recipes package has a more extensive list of preprocessing Alternatively, specific values of the tuning parameters can be
operations. declared using the tuneGrid argument:
• Have column names in your data.
• Use factors for a classification outcome (not 0/1 or integers). grid <- expand.grid(alpha = c(0.1, 0.5, 0.9),
• Have valid R names for class levels (not “0"/"1") Adding Options lambda = c(0.001, 0.01))

• Set the random number seed prior to calling train repeatedly train(x = x, y = y, method = "glmnet",
to get the same resamples across calls. Many train options can be specified using the trainControl preProc = c("center", "scale"),
function: tuneGrid = grid)
• Use the train option na.action = na.pass if you will
being imputing missing data. Also, use this option when train(y ~ ., data = dat, method = "cubist",
predicting new data containing missing values. trControl = trainControl(<options>))
Random Search
To pass options to the underlying model function, you can pass
them to train via the ellipses: Resampling Options For tuning, train can also generate random tuning parameter
train(y ~ ., data = dat, method = "rf", combinations over a wide range. tuneLength controls the total
# options to `randomForest`: trainControl is used to choose a resampling method: number of combinations to evaluate. To use random search:
importance = TRUE)
trainControl(method = <method>, <options>)
trainControl(search = "random")

Parallel Processing Methods and options are:

The foreach package is used to run models in parallel. The


• "cv" for K-fold cross-validation (number sets the # folds).
Subsampling
• "repeatedcv" for repeated cross-validation (repeats for #
train code does not change but a “do” package must be called repeats).
first. With a large class imbalance, train can subsample the data to
• "boot" for bootstrap (number sets the iterations). balance the classes them prior to model fitting.
# on MacOS or Linux # on Windows • "LGOCV" for leave-group-out (number and p are options).
library(doMC) library(doParallel)
• "LOO" for leave-one-out cross-validation. trainControl(sampling = "down")
registerDoMC(cores=4) cl <- makeCluster(2)
registerDoParallel(cl) • "oob" for out-of-bag resampling (only for some models). Other values are "up", "smote", or "rose". The latter two may
• "timeslice" for time-series data (options are require additional package installs.
The function parallel::detectCores can help too. initialWindow, horizon, fixedWindow, and skip).

CC BY Max Kuhn • [email protected] • https://github.com/topepo/ Learn more at https://topepo.github.io/caret/ • Updated: 9/17

You might also like