Skip to contents

`step_select_fcbf` creates a *specification* of a recipe step that selects a subset of predictors using the FCBF algorithm. The number of features retained depends on the `threshold` parameter: a lower threshold selects more features.

Usage

step_select_fcbf(
  recipe,
  ...,
  threshold = 0.025,
  outcome = NA,
  cutpoint = 0.5,
  features_retained = NA,
  removals = NULL,
  role = NA,
  trained = FALSE,
  skip = FALSE,
  id = rand_id("select_fcbf")
)

Arguments

recipe

A recipe object. The step will be added to the sequence of operations for this recipe.

...

One or more selector functions to choose which predictors are affected by the step. See [selections()] for more details. For the `tidy` method, these are not currently used.

threshold

A numeric value between 0 and 1 representing the symmetrical uncertainty threshold used by the FCBF algorithm. Lower thresholds allow more features to be selected.

outcome

A character string specifying the name of the response variable. Automatically inferred from the recipe (if possible) when not specified by the user.

cutpoint

A numeric value between 0 and 1 representing the quantile at which to split numeric features into binary nominal features. e.g. 0.5 = median split. See details for more information on discretization

features_retained

A tibble containing the features that were retained by the FCBF algorithm. This parameter is only produced after the recipe has been trained and should not be specified by the user

removals

A tibble containing the features that were removed by the FCBF algorithm. This parameter is only produced after the recipe has been trained, and should not be specified by the user

role

Not used for this step since new variables are not created.

trained

A logical to indicate if the quantities for preprocessing have been estimated.

skip

A logical. Should the step be skipped when the recipe is baked by bake.recipe()? While all operations are baked when prep.recipe() is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when using skip = TRUE as it may affect the computations for subsequent operations.

id

A character string that is unique to this step to identify it.

Value

An updated version of `recipe` with the new step added to the sequence of existing steps (if any). For the `tidy` method, a tibble with a `terms` column for which predictors were removed.

Details

This function implements the fast correlation-based filter (FCBF) algorithm as described in Yu & Liu (2003). FCBF selects features that have high correlation to the outcome, and low correlation to other features.

Symmetrical uncertainty (SU) is used to indicate the degree of correlation between predictors and the outcome. A threshold value for SU must be specified, and smaller threshold values will result in more features being selected by the algorithm. Appropriate thresholds are data-dependent, so different threshold values may need to be explored. It is not possible to specify an exact number of features that should be retained

The algorithm requires categorical features, so continuous features are discretized using a binary split (split at the median by default). Discretization is only used within the feature selection algorithm, selected features are then retained in their original continuous form for further processing.

The FCBF algorithm is implemented by the Bioconductor package 'FCBF', which can be installed with BiocManager::install("FCBF")

References

Yu, L. and Liu, H. (2003); Feature Selection for High-Dimensional Data A Fast Correlation Based Filter Solution, Proc. 20th Intl. Conf. Mach. Learn. (ICML-2003), Washington DC, 2003.

Examples