Wednesday 4 June 2014

Automated feature selection using random forest

Overview

Continuing on from the previous blog on random forest I decided to investigate how to perform automated feature selection using feature importance measurements. This type of process becomes import when the feature set becomes very large and the speed of the machine learning algorithm is a function of the number of features. Dimensionality reduction, feature set reduction, solves this problem using various algorithms such as principal component analysis. At this point in time I choose to keep my problem simple to expand my knowledge of the random forest algorithm while improving my R skills.

The implementation i present here is far from perfect or even production level, for example notice how I have omitted all the function arguments to randomforest never mind the R inefficiencies. However, the code is reasonable enough to demonstrate the basic idea of automated feature selection and some interesting R snippets for the novice R programmer. Source code is now availble on Github

Sequential feature selection algorithm

The algorithm to reduce the number of features while finding the minimum error rate is based on a two step process.

  1. Feature selection is performed by taking the feature MeanDecreaseGini value and determine if this passed the incrementing threshold function. If the value is greater than the threshold it will be a member of the feature set otherwise it is removed from further processing.
  2. Using the new feature set find the optimal number of tree with the lowest error rate.
On start up all features are used. On each pass of the feature selection loop the threshold is increase by 1% to help reduce the feature space.

The following sections present the code to perform learning and feature selection. I have reduce the code to the most interesting components, noticed I have introduced a user library and imported it using the 'source' function.

The program flow is as follows, section below:

  1. Load and prepare the data.
  2. Set the algorithm parameters.
  3. Execute the algorithm.
  4. Display various plots and model measurements.
  5. Export and save model to PMML.
The initial number of features including the target feature consisted of sixteen independent variables.

# Load the user lib files
source( 'lib/rforestHelperFunctions.R')

# Load the data, partition to train/test sets etc.
creditapps <- read.table("../data/creditcard_applications.txt",sep=",", header=FALSE)

etc.......

# Algorithm parameters
formula <- "V16 ~." 
attrThreshold <- 4
iterations <- 10
initTreeCount <- 55
treeCountStep <- 25

# Now call the algorithm
bestFittingTree <- sqBestFeatureModel.rf( formula,  trainingSet, "V16", iterations, initTreeCount, treeCountStep, attrThreshold )

tree <- bestFittingTree$tree
errors <- bestFittingTree$errors

# Output error plot
plot( errors)

# importance of each predictor
print(bestFittingTree )
importance( tree )
varImpPlot( tree)
plot(predict( object=tree, newdata=testingSet), testingTarget)


Algorithm implementation

The algorithm is split into two sections: selecting the best features and find the best number of trees for the given features. Below is a description of the required parameters.

Function parameters

  1. modelFormula - The R formula used for the randomForest function.
  2. data - The training data frame.
  3. targetVar - The target variable the
  4. maxIterations - Number of iterations to perform the search.
  5. numStartTrees - The initial number of trees to grow.
  6. treeStepCount - The tree growth increment for each step.
  7. attrThreshold - The minimum Gini threshold to use to remove features.
sqBestFeatureModel.rf <- function(modelFormula, data, targetVar, maxIterations, numStartTrees, treeStepCount, attrThreshold ){
  
  features <- colnames(trainingSet, do.NULL = TRUE, prefix = "col")
  errMeasure <- c()
  bestFit <- c()
  bestFeatureErr <- 2000
  
  # Loop through all features we have
  for (n in 1:maxIterations){
    
    X <- data[ features ]
    
    # Find optimal number of trees
    bestTree <- optimalTreeSize.rf( modelFormula, X, maxIterations, numStartTrees, treeStepCount)
    
    # Get error of training and track
    featureErr <- sum( data.frame(bestTree$confusion)["class.error"] )/2
    errMeasure <- append(errMeasure, featureErr, 1)
    
    # Save the best performing forest and feature set
    if( featureErr < bestFeatureErr ){
      bestFit <- bestTree  
      bestFeatures <- features
      bestFeatureErr <- featureErr
      
      imp <- data.frame( bestFit$importance )
      keepAttr <- ifelse( imp$MeanDecreaseGini > attrThreshold, TRUE, FALSE)
      features <- features[ keepAttr ]
      #print(bestFit)
      #print( features)
    }
    
    # Add back in the target variable
    if( ! targetVar %in% features){
      features <- append( features, targetVar)
    } 
    
    # Increase threshold  
    attrThreshold <- attrThreshold + 1
  }
  
  return( list(tree=bestFit, errors=c(errMeasure)))
}


Optimal trees function

This function returns the optimal number of tree to grow for the given feature. This is performed by growing N number of trees per iteration while keeping the best formed forest.

Function parameters

  1. modelFormula - The R formula used for the randomForest function.
  2. trainingData - The training data frame.
  3. iterations - Number of iterations to perform the search.
  4. initNumTree - The initial number of trees to grow.
  5. treeStep - The tree growth increment for each step.
optimalTreeSize.rf <- function( modelFormula, trainingData, iterations, initNumTree, treeStep ){
  bestTree <- c()
  prevErr <- 100
  for(i in 1:iterations){
    # Train the RF using selected set of features for N random trees
    fit <- randomForest( as.formula(modelFormula), data=trainingData, ntree=initNumTree)
    currErr <- sum( data.frame(fit$confusion)["class.error"] )/2
    if(  currErr < prevErr ){
      bestTree <- fit
    }
    prevErr <- currErr
    initNumTree <- initNumTree + treeStep
  }
  return(bestTree)
}


Algorithm Results

This section proves the algorithm is working by keeping the best model with the lowest error while reducing the feature set. As we can see from the textual and plot output the final tree has the lowest error rate while reducing the number of features to nine. The result is significantly better than manually working through permutations of features.

> print(bestFittingTree)
$tree

Call:
 randomForest(formula = as.formula(modelFormula), data = trainingData, ntree = initNumTree) 
               Type of random forest: classification
                     Number of trees: 230
No. of variables tried at each split: 3

        OOB estimate of  error rate: 11.59%
Confusion matrix:
     NO YES class.error
NO  190  31   0.1402715
YES  17 176   0.0880829

$errors
 [1] 0.1222774 0.1141772 0.1238834 0.1216210 0.1193585 0.1264741 0.1245399 0.1238834 0.1290648 0.1271306

> importance( tree )
    MeanDecreaseGini
V2          14.73728
V3          13.69773
V6          20.97015
V8          20.19242
V9          71.86184
V10         12.66245
V11         23.24506
V14         12.78868
V15         13.48829


Summary

This algorithm improves on the previous blog error rate by 29% and while testing I seen this increase to 39% with an error rate approaching 10%. Given it simple approach its rather effective although before committing myself with this algorithm I would want to work with much larger datasets to determine its true abilities to select features. Another algorithm I would like to explore is the Wiggle Algorithm, made up name, where you adjust the feature variables to determine the effective prediction output, although this is only applicable for pure numerical datasets.

On an ending note I can imagine there are a number of R inefficiencies so please feel free to either enhance or send me the details.

Technologies

  1. Source code now availble on Github
  2. RStudio
  3. Random Forest R package
  4. Good explanation of Random Forest