Generalized birthday problem Rcpp

In the previous post, I tried to use some algorithms to speed up simulation. Now in this post, I will introduce a pretty powerful package Rcpp, which has the potential to fundementally change the way you do simulation if you want to employ a time-consuming simulation in R. Let’s finish this project!

Rcpp

Rcpp is basically a package that incorporates C++ code in R. For me, using Rcpp means my code can work faster while I can still enjoy the convenience of many versatile R packages. A simple way to use Rcpp is the function cppFunction. C++ code comes in as text in your R script. In this example, I define the function simulateC, which simulates and calculate the probility of a birthday shared by at least k persons in a room size of room (sorry for inconsistent notation, but I prefer i as index in a loop). I am totally not an expert in C++, so I am not goint through the code (this script is my first C++ script, to be honest). This is just to give a taste of how fast a function created by Rcpp can be.

library(Rcpp)
cppFunction('
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
float simulateC (int k, int room, int replicate) {
  IntegerVector result (replicate); 
  for (int r=0; r<replicate; r++){
    IntegerVector counts(365);
    for (int i=0; i<room; i++) {
      int pos = rand() % 365;
      if (pos < 365 && pos >= 0) counts[pos] ++;
    } 
    if (max(counts) >= k) {
      result[r] = TRUE;
    } else result[r] = FALSE;
  }
  float prob = float(sum(result))/float(replicate);
  return prob;
}
')

Let’s benchmark the time spent by either tabulate or Rcpp method. rbenchmark is a package that we can measure the computer time consumed.

library(rbenchmark)
k <- 2
room <- 23
r <- 1e4
set.seed(1)
b <- benchmark("Rcpp" = {simulateC(k=2, room=room, replicate=r)},
               'Tabulate' = {
                 birth <- matrix(sample(1:365, r*room, replace=TRUE), nrow=r, ncol=room)
                 sum(apply(birth, 1, function(x) max(tabulate(x))>=k)) /r},
               replications = 1000,
               columns=c('test', 'elapsed', 'replications', 'relative')
)

b 

Bascially, benchmark runs each methods for 1000 times, and compare the time consumed. It takes 7.643/1000 = 0.007643 seconds for a run of simulateC(k=2, room=23, replicate=1e4) and 0.065545 seconds for a run of tabulate method. So on average, the Rcpp method is 8.576 times faster than tabulate methods.

Generalized birthday problem

This is the finall verison of function in generalized birthday problem project. Simply replace the code for simulation of one room size by simulateC. Now we can try large room size and large replicates that are unbelievably time-consuming if we do it withou RCpp.

library(Rcpp)
gbp_rcpp <- function(k=2, p=.5, B_l=1000, B_u=20000, threshold=.001) {
  i <- k                       # Tested room size; start from k persons in a room
  x <- 1                       # Counter for while loop
  L <- 0                       # Lower bound for binary search
  R <- k^4+100+sample(-k:k, 1) # Upper bound for binary search
  
  # Create an empty data frame for result
  result <- setNames(data.frame(matrix(0, 100, 3)), c("room", "p", "replicate"))
  
  while (abs(result$room[x] - p) >= threshold) {
    # Increase replicate when approaching target room size
    r <- floor(-B_u/(1+exp(-.03* ((R-L)-200))))+(B_u+B_l)
    result$replicate[x] <- r
    
    # Test room size i
    i <- ceiling((L+R)/2)
    result$room[x] <- i 

    # Calculate probability by running simulation in C++
    result$p[x] <- simulateC(k=k, room=i, replicate = r)
    if (result$p[x] >= p) R <- i
    if (result$p[x] < p) L <- i
    
    # Break the loop if trapped
    if(sum(duplicated(tail(result$room[1:x], 2))) == 1 & result$room[x] != 0) break
    
    # Print out room sizes tested
    cat(paste0('Room size tested:', i, '\tprobability:', round(result$p[x], 3), '\treplicate size: ', r, '\n'))
    
    # Counter for steps
    x <- x + 1          
 }
  ans <- R
  return(list(answer=ans,
              result=result[1:x,]))
}

Let’s try test the ideal room size for k=8 and p=0.5. When approaching the ideal room size, increase the replicate size up to one million.

set.seed(1)
system.time(y_rcpp <- gbp_rcpp(k=8, p=.5, B_u=1000000))
y_rcpp

Now we have confident to say the ideal room size for k=8 is 798.

GeneralBirthdayProblem package

In case you want to play with generalized birthday problem and the source code, or don’t want to write from the very beginning. I uploaded the package to my github. You can simply dowload the R package by the following command in your R interface.

devtools::install_github("Chang-Yu-Chang/GeneralBirthdayProblem")

This package only has two functions: simulateC and gbp. These two functions are basically the same as what I used throughout this post. simulateC simulates and reports the probability given k, room size, and the replicate size.

simulateC(k=2, room=23, replicate=1e5)

gbp incorporates the simulateC function and finds the ideal room size given k and p. I set the printing flexible so that if you don’t want the result to be printed every time.

gbp(k=2, p=0.5) # Default is print=FALSE
Chang-Yu Chang
Chang-Yu Chang
PhD candidate in Ecology and Evolutionary Biology