R Tip: Break up Function Nesting for Legibility

Tags: ,

There are a number of easy ways to avoid illegible code nesting problems in R.

In this R tip we will expand upon the above statement with a simple example.

At some point it becomes illegible and undesirable to compose operations by nesting them, such as in the following code.

   head(mtcars[with(mtcars, cyl == 8), c("mpg", "cyl", "wt")])

#                     mpg cyl   wt
# Hornet Sportabout  18.7   8 3.44
# Duster 360         14.3   8 3.57
# Merc 450SE         16.4   8 4.07
# Merc 450SL         17.3   8 3.73
# Merc 450SLC        15.2   8 3.78
# Cadillac Fleetwood 10.4   8 5.25

One popular way to break up nesting is to use magrittr‘s “%>%” in combination with dplyr transform verbs as we show below.

library("dplyr")

mtcars                 %>%
  filter(cyl == 8)     %>%
  select(mpg, cyl, wt) %>%
  head

#    mpg cyl   wt
# 1 18.7   8 3.44
# 2 14.3   8 3.57
# 3 16.4   8 4.07
# 4 17.3   8 3.73
# 5 15.2   8 3.78
# 6 10.4   8 5.25

Note: the above code lost (without warning) the row names that are part of mtcars. We also pass over the details of how pipe notation works. It is sufficient to say the notational convention is: each stage is approximately treated as an altered function call with a new inserted first argument set to the value of the pipeline up to the current point.

Many R users already routinely avoid nested notation problems through a convention I call “name re-use.” Such code looks like the following.

result <- mtcars
result <- filter(result, cyl == 8)
result <- select(result, mpg, cyl, wt)
head(result)

The above convention is enough to get around all problems of nesting. It also has the great advantage that it is step-debuggable. I recommend introducing and re-using a result name (in this case “result“), and not re-using the starting data name (in this case “mtcars“). This extra care makes the entire block restartable which is another benefit when developing and debugging.

I like a variation I call “dot intermediates”, which looks like the code below (notice we are switching back from dplyr verbs, to base Roperators).

. <- mtcars
. <- subset(., cyl == 8)
. <- .[, c("mpg", "cyl", "wt")]
result <- .
head(result)

#                     mpg cyl   wt
# Hornet Sportabout  18.7   8 3.44
# Duster 360         14.3   8 3.57
# Merc 450SE         16.4   8 4.07
# Merc 450SL         17.3   8 3.73
# Merc 450SLC        15.2   8 3.78
# Cadillac Fleetwood 10.4   8 5.25

The dot intermediate convention is very succinct, and we can use it with base R transforms to get a correct (and performant) result. The dot intermediates convention is particularly neat when you don’t intend to take the result further into your calculation (such as when you only want to print it) as it does not require you to think up an evocative result name. Like all conventions: it is just a matter of teaching, learning, and repetition to make this seem natural, familiar and legible.

Also, contrary to what many repeat, base R is often faster than the dplyralternative.

library("dplyr")
library("microbenchmark")
library("ggplot2")

timings <- microbenchmark(
  base = {
    . <- mtcars
    . <- subset(., cyl == 8)
    . <- .[, c("mpg", "cyl", "wt")]
    nrow(.)
  },
  dplyr = {
    mtcars                 %>%
      filter(cyl == 8)     %>%
      select(mpg, cyl, wt) %>%
      nrow
  })

print(timings)

## Unit: microseconds
##   expr      min       lq      mean   median       uq       max neval
##   base  122.948  136.948  167.2253  159.688  179.924   349.328   100
##  dplyr 1570.188 1654.700 2537.2912 1699.744 1785.611 50759.770   100

autoplot(timings)


UnknownDurations for related tasks, smaller is better.
In this case the base R is 15 times faster (possibly due to magrittr overhead and the small size of this example). We also see, with some care, base R can be quite legible. dplyr is a useful tool and convention, however it is not the only allowed tool or only allowed convention.

 


 

Original Source