sl3: Machine Learning Pipelines for R

Simplifying machine learning in R through pipelines
R data science machine learning computing

This post is a working draft and is published only to share with coauthors.

Common in the language of modern data science are words such as “munging,” “massaging,” “mining” – all words denoting the interactive process by which the analyst extracts some form of deliverable inference from a given data set. These terms express, among other things, the (often) convoluted process by which a set of pre-processing and estimation procedures are applied to an input data set in order to transform said data set into a “tidy” output data set from which informative visualizations and summaries may be easily extracted. A formalism that captures this involved process is that of machine learning pipelines. A pipeline, popularized by the method of the same name in Python’s scikit-learn library, may be thought of as a simple bundle that documents procedures to be applied to as input data set in a particular order, ultimately resulting in a tidy output data set.

Recently, the pipeline idiom has made its way into the R programming language, via the new sl3 R package. A concrete understanding of the utility of pipelines is best developed by example – so, that’s precisely what we’ll do! In the following, we’ll apply the concept of a machine learning pipeline to the canonical iris data set, combining a series of learners (machine learning algorithms for estimation/classification) with Principal Components analysis, a simple pre-processing step.

library(datasets)
library(tidyverse)
library(data.table)
library(caret)
library(sl3)
set.seed(352)

data(iris)
iris <- iris %>%
  as_tibble(.)
iris
## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
##  1          5.1         3.5          1.4         0.2  setosa
##  2          4.9         3.0          1.4         0.2  setosa
##  3          4.7         3.2          1.3         0.2  setosa
##  4          4.6         3.1          1.5         0.2  setosa
##  5          5.0         3.6          1.4         0.2  setosa
##  6          5.4         3.9          1.7         0.4  setosa
##  7          4.6         3.4          1.4         0.3  setosa
##  8          5.0         3.4          1.5         0.2  setosa
##  9          4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## # ... with 140 more rows

To create very simple training and testing splits, we’ll rely on the popular caret R package:

trn_indx <- createDataPartition(iris$Species, p = .8, list = FALSE,
                                times = 1) %>%
  as.numeric()
tst_indx <- which(!(seq_len(nrow(iris)) %in% trn_indx))

Now that we have our training and testing splits, we can organize the data into tasks – the central bookkeeping object in the sl3 framework. Essentially, tasks represent a, well, data analytic task that is to be solved by invoking the various machine learning algorithms made available by sl3.

# a task with the data from the training split
iris_task_train <- sl3_Task$new(
  data = iris[trn_indx, ],
  covariates = colnames(iris)[-5],
  outcome = colnames(iris)[5],
  outcome_type = "categorical"
)
iris_task_train
## A sl3 Task with 120 obs and these nodes:
## $covariates
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## 
## $outcome
## [1] "Species"
## 
## $id
## NULL
## 
## $weights
## NULL
## 
## $offset
## NULL
# a task with the data from the testing split
iris_task_test <- sl3_Task$new(
  data = iris[tst_indx, ],
  covariates = colnames(iris)[-5],
  outcome = colnames(iris)[5],
  outcome_type = "categorical"
)
iris_task_test
## A sl3 Task with 30 obs and these nodes:
## $covariates
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## 
## $outcome
## [1] "Species"
## 
## $id
## NULL
## 
## $weights
## NULL
## 
## $offset
## NULL

Having set up the data properly, let’s proceed to design pipelines that we can rely on for processing and analyzing the data. A pipeline simply represents a set of machine learning procedures to be invoked sequentially, with the results derived from earlier algorithms in the pipeline being used to train those later in the pipeline. Thus, a pipeline is a closed end-to-end system for resolving the problem posed by an sl3 task.

We’ll rely on PCA for dimension reduction, gathering only the two most important principal component dimensions to use in training our classification models. Since this is a quick experiment with a well-studied data set, we’ll use just two classification procedures: (1) Logistic regression with regularization (e.g., the LASSO) and (2) Random Forests.

pca_h2o <- Lrnr_h2o_mutator$new(algorithm = "pca", k = 2, impute_missing = TRUE)
glmnet_learner <- Lrnr_glmnet$new()
rf_learner <- Lrnr_randomForest$new()

Above, we merely instantiate the learners by invoking the $new() method of each of the appropriate objects. We now have a PCA method that generates and extracts just the first two principal components derived from the design matrix.

Aside on H2O: Here, we rely on the implementation of PCA by H2O.ai; this implementation takes advantage of the H2O data science platform and is available for R via the h2o package. Since we’ll be using H2O, we need to set up an appropriate cluster on our local machine with

library(h2o)
h2o.init()
## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     /var/folders/sr/8wdg8m6s5pv211sp22lr5dlw0000gn/T//RtmpNzp5NH/h2o_nimahejazi_started_from_r.out
##     /var/folders/sr/8wdg8m6s5pv211sp22lr5dlw0000gn/T//RtmpNzp5NH/h2o_nimahejazi_started_from_r.err
## 
## 
## Starting H2O JVM and connecting: ... Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         3 seconds 54 milliseconds 
##     H2O cluster version:        3.16.0.2 
##     H2O cluster version age:    1 month and 2 days  
##     H2O cluster name:           H2O_started_from_R_nimahejazi_efh185 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.78 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         XGBoost, Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.4.3 (2017-11-30)

Other than our PCA learner, we’ve also instantiated a logistic regression model (fglm_learner above) based on the fast implementation available through the speedglm R package, as well as a random forest model based on the canonical implementation available in the randomForest R package.

Now that our individual learners are set up, we can intuitively string them into pipelines like so

pca_to_glmnet <- Pipeline$new(pca_h2o, glmnet_learner)
pca_to_rf <- Pipeline$new(pca_h2o, rf_learner)

The first pipeline above merely invokes our PCA learner, extracting the first two principal components of the design matrix from the input task and passing these as inputs to the logistic regression model. Similarly, the second pipeline invokes PCA and passes the results to our random forest model.

To streamline the training of our pipelines, we’ll bundle them into a single stack, then train the model stack all at once. Similar in spirit to a pipeline, a stack is a bundle of sl3 learner objects that are to be trained together. The principle difference is that learners in a pipeline are trained sequentially, as described above, while those in a stack are trained in parallel (not in the computational sense, though we can, of course, speed up the fitting procedure with parallelization). Thus, the models in a stack are trained independently of one another.

Now, let’s go ahead a generate a stack and train the two pipelines on our training split of the iris dataset:

model_stack <- Stack$new(pca_to_glmnet, pca_to_rf)
fit_model_stack <- model_stack$train(iris_task_train)
out_model_stack <- fit_model_stack$predict(iris_task_test)
pipe1_preds <- as.data.table(matrix(unlist(out_model_stack[[1]]), ncol = 3))
pipe2_preds <- as.data.table(matrix(unlist(out_model_stack[[2]]), ncol = 3))

After extracting the predicted probabilities of each observation being in a given class (the iris species), we now clean up the results a bit to make them more report-able

outcome_names <- c("setosa", "versicolor", "virginica")
setnames(pipe1_preds, outcome_names)
setnames(pipe2_preds, outcome_names)

# get class predictions
(pipe1_classes <- outcome_names[apply(pipe1_preds, 1, which.max)])
##  [1] "setosa"     "versicolor" "virginica"  "setosa"     "versicolor"
##  [6] "virginica"  "setosa"     "versicolor" "virginica"  "setosa"    
## [11] "versicolor" "virginica"  "setosa"     "versicolor" "virginica" 
## [16] "setosa"     "versicolor" "virginica"  "setosa"     "versicolor"
## [21] "virginica"  "setosa"     "versicolor" "virginica"  "setosa"    
## [26] "versicolor" "virginica"  "setosa"     "versicolor" "virginica"
(pipe2_classes <- outcome_names[apply(pipe2_preds, 1, which.max)])
##  [1] "setosa"     "versicolor" "virginica"  "setosa"     "versicolor"
##  [6] "virginica"  "setosa"     "versicolor" "virginica"  "setosa"    
## [11] "versicolor" "virginica"  "setosa"     "versicolor" "virginica" 
## [16] "setosa"     "versicolor" "virginica"  "setosa"     "versicolor"
## [21] "virginica"  "setosa"     "versicolor" "virginica"  "setosa"    
## [26] "versicolor" "virginica"  "setosa"     "versicolor" "virginica"

Renjin: put some Java in your R

Quick look at using R with some Java 'under the hood'
R data science statistics computing

Taking blogdown for a test drive

Trying out RStudio's new blogging framework
R data science tools productivity

A shell called xonsh

A review of Xonsh, a new Python-facing shell
tools productivity computing