Fast Correlation Based Filter for Feature Selection
Source:R/step_select_fcbf.R
step_select_fcbf.Rd
`step_select_fcbf` creates a *specification* of a recipe step that selects a subset of predictors using the FCBF algorithm. The number of features retained depends on the `threshold` parameter: a lower threshold selects more features.
Usage
step_select_fcbf(
recipe,
...,
threshold = 0.025,
outcome = NA,
cutpoint = 0.5,
features_retained = NA,
removals = NULL,
role = NA,
trained = FALSE,
skip = FALSE,
id = rand_id("select_fcbf")
)
Arguments
- recipe
A recipe object. The step will be added to the sequence of operations for this recipe.
- ...
One or more selector functions to choose which predictors are affected by the step. See [selections()] for more details. For the `tidy` method, these are not currently used.
- threshold
A numeric value between 0 and 1 representing the symmetrical uncertainty threshold used by the FCBF algorithm. Lower thresholds allow more features to be selected.
- outcome
A character string specifying the name of the response variable. Automatically inferred from the recipe (if possible) when not specified by the user.
- cutpoint
A numeric value between 0 and 1 representing the quantile at which to split numeric features into binary nominal features. e.g. 0.5 = median split. See details for more information on discretization
- features_retained
A tibble containing the features that were retained by the FCBF algorithm. This parameter is only produced after the recipe has been trained and should not be specified by the user
- removals
A tibble containing the features that were removed by the FCBF algorithm. This parameter is only produced after the recipe has been trained, and should not be specified by the user
- role
Not used for this step since new variables are not created.
- trained
A logical to indicate if the quantities for preprocessing have been estimated.
- skip
A logical. Should the step be skipped when the recipe is baked by bake.recipe()? While all operations are baked when prep.recipe() is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when using skip = TRUE as it may affect the computations for subsequent operations.
- id
A character string that is unique to this step to identify it.
Value
An updated version of `recipe` with the new step added to the sequence of existing steps (if any). For the `tidy` method, a tibble with a `terms` column for which predictors were removed.
Details
This function implements the fast correlation-based filter (FCBF) algorithm as described in Yu & Liu (2003). FCBF selects features that have high correlation to the outcome, and low correlation to other features.
Symmetrical uncertainty (SU) is used to indicate the degree of correlation between predictors and the outcome. A threshold value for SU must be specified, and smaller threshold values will result in more features being selected by the algorithm. Appropriate thresholds are data-dependent, so different threshold values may need to be explored. It is not possible to specify an exact number of features that should be retained
The algorithm requires categorical features, so continuous features are discretized using a binary split (split at the median by default). Discretization is only used within the feature selection algorithm, selected features are then retained in their original continuous form for further processing.
The FCBF algorithm is implemented by the Bioconductor package 'FCBF', which can be installed with BiocManager::install("FCBF")