Generalized Birthday Problem: Rcpp

In the previous post, I tried to use some algorithms to speed up simulation. Now in this post, I will introduce a pretty powerful package Rcpp, which has the potential to fundementally change the way you do simulation if you want to employ a time-consuming simulation in R. Let’s finish this project!

Rcpp

Rcpp is basically a package that incorporates C++ code in R. For me, using Rcpp means my code can work faster while I can still enjoy the convenience of many versatile R packages. A simple way to use Rcpp is the function cppFunction. C++ code comes in as text in your R script. In this example, I define the function simulateC, which simulates and calculate the probility of a birthday shared by at least k persons in a room size of room (sorry for inconsistent notation, but I prefer i as index in a loop). I am totally not an expert in C++, so I am not goint through the code (this script is my first C++ script, to be honest). This is just to give a taste of how fast a function created by Rcpp can be.

library(Rcpp)
cppFunction("
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
float simulateC (int k, int room, int replicate) {
  IntegerVector result (replicate); 
  for (int r=0; r<replicate; r++){
    IntegerVector counts(365);
    for (int i=0; i<room; i++) {
      int pos = rand() % 365;
      if (pos < 365 && pos >= 0) counts[pos] ++;
    } 
    if (max(counts) >= k) {
      result[r] = TRUE;
    } else result[r] = FALSE;
  }
  float prob = float(sum(result))/float(replicate);
  return prob;
}
")

Let’s benchmark the time spent by either tabulate or Rcpp method. rbenchmark is a package that we can measure the computer time consumed.

library(rbenchmark)
k <- 2
room <- 23
r <- 10000
set.seed(1)
b <- benchmark(Rcpp = {
    simulateC(k = 2, room = room, replicate = r)
}, Tabulate = {
    birth <- matrix(sample(1:365, r * room, replace = TRUE), 
        nrow = r, ncol = room)
    sum(apply(birth, 1, function(x) max(tabulate(x)) >= k))/r
}, replications = 1000, columns = c("test", "elapsed", "replications", 
    "relative"))

b
##       test elapsed replications relative
## 1     Rcpp   7.643         1000    1.000
## 2 Tabulate  65.545         1000    8.576

Bascially, benchmark runs each methods for 1000 times, and compare the time consumed. It takes 7.643/1000 = 0.007643 seconds for a run of simulateC(k=2, room=23, replicate=1e4) and 0.065545 seconds for a run of tabulate method. So on average, the Rcpp method is 8.576 times faster than tabulate methods.

Generalized birthday problem

This is the finall verison of function in generalized birthday problem project. Simply replace the code for simulation of one room size by simulateC. Now we can try large room size and large replicates that are unbelievably time-consuming if we do it withou RCpp.

library(Rcpp)
gbp_rcpp <- function(k = 2, p = 0.5, B_l = 1000, B_u = 20000, 
    threshold = 0.001) {
    i <- k  # Tested room size; start from k persons in a room
    x <- 1  # Counter for while loop
    L <- 0  # Lower bound for binary search
    R <- k^4 + 100 + sample(-k:k, 1)  # Upper bound for binary search
    
    # Create an empty data frame for result
    result <- setNames(data.frame(matrix(0, 100, 3)), c("room", 
        "p", "replicate"))
    
    while (abs(result$room[x] - p) >= threshold) {
        # Increase replicate when approaching target room size
        r <- floor(-B_u/(1 + exp(-0.03 * ((R - L) - 200)))) + 
            (B_u + B_l)
        result$replicate[x] <- r
        
        # Test room size i
        i <- ceiling((L + R)/2)
        result$room[x] <- i
        
        # Calculate probability by running simulation in C++
        result$p[x] <- simulateC(k = k, room = i, replicate = r)
        if (result$p[x] >= p) 
            R <- i
        if (result$p[x] < p) 
            L <- i
        
        # Break the loop if trapped
        if (sum(duplicated(tail(result$room[1:x], 2))) == 1 & 
            result$room[x] != 0) 
            break
        
        # Print out room sizes tested
        cat(paste0("Room size tested:", i, "\tprobability:", 
            round(result$p[x], 3), "\treplicate size: ", r, "\n"))
        
        # Counter for steps
        x <- x + 1
    }
    ans <- R
    return(list(answer = ans, result = result[1:x, ]))
}

Let’s try test the ideal room size for k=8 and p=0.5. When approaching the ideal room size, increase the replicate size up to one million.

set.seed(1)
system.time(y_rcpp <- gbp_rcpp(k = 8, p = 0.5, B_u = 1e+06))
## Room size tested:2096    probability:1   replicate size: 1000
## Room size tested:1048    probability:0.964   replicate size: 1000
## Room size tested:524 probability:0.037   replicate size: 1000
## Room size tested:786 probability:0.49    replicate size: 1060
## Room size tested:917 probability:0.804   replicate size: 135703
## Room size tested:852 probability:0.646   replicate size: 888952
## Room size tested:819 probability:0.557   replicate size: 983363
## Room size tested:803 probability:0.514   replicate size: 994373
## Room size tested:795 probability:0.492   replicate size: 996889
## Room size tested:799 probability:0.503   replicate size: 997858
## Room size tested:797 probability:0.497   replicate size: 998213
## Room size tested:798 probability:0.5 replicate size: 998374
##    user  system elapsed 
##  47.737   0.537  57.407
y_rcpp
## $answer
## [1] 798
## 
## $result
##    room         p replicate
## 1  2096 1.0000000      1000
## 2  1048 0.9640000      1000
## 3   524 0.0370000      1000
## 4   786 0.4896227      1060
## 5   917 0.8037553    135703
## 6   852 0.6463184    888952
## 7   819 0.5572042    983363
## 8   803 0.5140676    994373
## 9   795 0.4922083    996889
## 10  799 0.5027779    997858
## 11  797 0.4972315    998213
## 12  798 0.5001422    998374
## 13  798 0.5006149    998452

Now we have confident to say the ideal room size for k=8 is 798.

GeneralBirthdayProblem package

In case you want to play with generalized birthday problem and the source code, or don’t want to write from the very beginning. I uploaded the package to my github. You can simply dowload the R package by the following command in your R interface.

devtools::install_github("Chang-Yu-Chang/GeneralBirthdayProblem")

This package only has two functions: simulateC and gbp. These two functions are basically the same as what I used throughout this post. simulateC simulates and reports the probability given k, room size, and the replicate size.

simulateC(k = 2, room = 23, replicate = 1e+05)
## [1] 0.50724

gbp incorporates the simulateC function and finds the ideal room size given k and p. I set the printing flexible so that if you don’t want the result to be printed every time.

gbp(k = 2, p = 0.5)  # Default is print=FALSE
## [1] 23

Related

comments powered by Disqus