Generalized Birthday Problem

This tiny project comes from the coding class I am taking this semester. The simplest birthday problem is as followed: how many people should be in a room such that the probability of a birthday shared by at least two persons is just above 0.5? I found this problem is very interesting. Throughout the process dealing this problem, some powerful R packages and essential coding skills are incorporated for more efficient simulation. In this project, I will try to address generalized birthday problem in an intuitive way (easy coding but computer expensive and slow) to somehow more advanced approaches.

What’s probability of one birthday shared by two persons in any size of room?

The probability we are interested calculated such that a birthday shared by at least k persons. The targeted probability p here is 0.5. For each room size i, r replicates are tested.

In the following case, I did only 20 replicates.

k <- 2                # Birthday shared by k persons
p <- 0.5              # Targetd probability
r <- 20               # Replicates
result <- rep(0, 100) # A vector for saving result 
i <- 10                  # Room size

Now here is trick. For each room size, create a single matrix that has r rows and i columns. The next step is to check by row: if there are at least k=2 persons in one room (i=10) sharing the same birthday, return TRUE. Use apply to implicitly loop through each row. The result probability would be the sum of this vector divided by total replicate r.

set.seed(12345)
birth <- matrix(sample(1:365, r*i, replace=TRUE), nrow=r, ncol=i)
birth
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]  264  166  286  289  318  108  205  191  271    62
##  [2,]  320  120  157   95  188  226  188  254  218   150
##  [3,]  278  353  339  360    4  356   33   79  311    72
##  [4,]  324  259  283  277    8  226  250  230  298    95
##  [5,]  167  236   95  358   53  191  260   54  139   355
##  [6,]   61  143  118   80  112  330  293  351  180   297
##  [7,]  119  255   22  347  302  233  345  362  363   127
##  [8,]  186  199   16   55  184  316   69  182   64   343
##  [9,]  266   83   21  220  294   92  102  336  305    60
## [10,]  362  177  229  346   23   79  283  136  145    58
## [11,]   13  290  353  252  339  223   13  302  354   276
## [12,]   56    3  302  185  295  140  246  205   69   324
## [13,]  269   69  115  137   29  276  297  291  274   112
## [14,]    1  249   78  123  220  139  243  129  113   114
## [15,]  143  136  268   18  261  291  345   77  139   360
## [16,]  169  132  183  226  188  331  320  287  275     3
## [17,]  142  318  267  351  263  360  255  250  258   101
## [18,]  147  331   30  240  274  215   68  158  337   292
## [19,]   66  226  159  187   35    4   36  309  291    89
## [20,]  348   49   87   55  146  118  295  220  303   168

The probability for 2 persons sharing the same birthday in a room with 10 persones is roughly 0.1.

sum(apply(birth, 1, function(x) max(tabulate(x))>=k )) / r
## [1] 0.1

Now test different room sizes. Say I wanna test room sizes ranging from k to 100 (testing room size smaller than k is nonsense). Use for loop to go through room size from k to 100.

k <- 2                # Birthday shared by k persons
p <- 0.5              # Targetd probability
r <- 100000           # Replicates
result <- rep(0, 100) # A vector for saving result 

system.time({         # Record time used
for (i in 1:100) {    # Test room sizes from 1 to 100 persons
  birth <- matrix(sample(1:365, r*i, replace=TRUE), nrow=r, ncol=i)
  result[i] <- sum(apply(birth, 1, function(x) max(tabulate(x))>=k )) / r
}
})
##    user  system elapsed 
##  55.684   4.624  60.952

This simulation is extremely slow. Something is pretty obvious, for example, if I already know the result is going to be 23 just after calculating it, I don’t need to do the else (24-100), which consumes tons of time. For the rest of this post and maybe the other post, I will introduce some ways to speed up the simulation or make it more accurracy.

plot(result, type="l", xlab="room size", ylab="probability", main="probability of a birthday shared by k persons")
abline(h=p, col="red") 

The answer is 23 persons.

min(which(result>p))
## [1] 23

Replicate size

Same logic can be applied to replicate size. We only need to increase replicates size when close to the target probability. If we know there are still hundreds of rooms size between current roome size and ideal room size, simulating a room size for millions times seems to be bad idea. As we are getting closer to the ideal room size, accuracy becomes more and more important. Accuracy may not be a big deal when k is 2 or 3, since the ideal room size is small. However, for larger k, for example, 8, then the efficiency become important for the first few steps.

To gradually increase replicate size, a sigmoidal curve might be a good idea.

B_l <- 1000
B_u <- 20000
x <- 1:500
y <- floor(-B_u/(1+exp(-.03* (x-200))))+(B_u+B_l)
plot(x, y, type='l', xlab='distance to ideal room size', ylab='replicate size')

When we are approximately 400 persons away the ideal room size, just do 1000 replicates. Whereas almost being to the goal, try 20000 replicates. Then the function can be modified as followed.

gbp <- function(k=2, p=.5, B_l=1000, B_u=20000, threshold=.001) {
  i <- k                       # Tested room size; start from k persons in a room
  x <- 1                       # Counter for while loop
  L <- 0                       # Lower bound for binary search
  R <- k^4+100+sample(-k:k, 1) # Upper bound for binary search
  
  # Create an empty data frame for result
  result <- setNames(data.frame(matrix(0, 100, 3)), c("room", "p", "replicate"))
  
  while (abs(result$room[x] - p) >= threshold) {
    # Increase replicate when approaching target room size
    r <- floor(-B_u/(1+exp(-.03* ((R-L)-200))))+(B_u+B_l)
    result$replicate[x] <- r
    
    # Test room size i
    i <- ceiling((L+R)/2)
    result$room[x] <- i 

    # Calculate probability
    birth <- matrix(sample(1:365, r*i, replace=TRUE), nrow=r, ncol=i)
    result$p[x] <- sum(apply(birth, 1, function(x) max(tabulate(x))>=k )) / r
    if (result$p[x] >= p) R <- i
    if (result$p[x] < p) L <- i
    
    # Break the loop if trapped
    if(sum(duplicated(tail(result$room[1:x], 2))) == 1 & result$room[x] != 0) break
    
    # Print out room sizes tested
    cat(paste0('Room size tested:', i, '\tprobability:', round(result$p[x], 3), '\treplicate size: ', r, '\n'))
    
    # Counter for steps
    x <- x + 1          
 }
  ans <- R
  return(list(answer=ans,
              result=result[1:x,]))
}

system.time(y<-gbp(k=8, p=.5))
## Room size tested:2098    probability:1   replicate size: 1000
## Room size tested:1049    probability:0.97    replicate size: 1000
## Room size tested:525 probability:0.049   replicate size: 1000
## Room size tested:787 probability:0.49    replicate size: 1001
## Room size tested:918 probability:0.81    replicate size: 3694
## Room size tested:853 probability:0.643   replicate size: 18759
## Room size tested:820 probability:0.562   replicate size: 20647
## Room size tested:804 probability:0.515   replicate size: 20867
## Room size tested:796 probability:0.497   replicate size: 20917
## Room size tested:800 probability:0.505   replicate size: 20937
## Room size tested:798 probability:0.501   replicate size: 20944
## Room size tested:797 probability:0.497   replicate size: 20947
## Room size tested:798 probability:0.5 replicate size: 20949
##    user  system elapsed 
##   6.357   1.024   7.704
y
## $answer
## [1] 23
## 
## $result
##   room         p replicate
## 1  116 0.0000000         0
## 2   58 0.9880000      1000
## 3   29 0.6809149     20985
## 4   15 0.2538821     20994
## 5   22 0.4782816     20996
## 6   26 0.5960183     20996
## 7   24 0.5389818     20997
## 8   23 0.5041196     20997
## 9   23 0.5096442         0

I will continue the rest of this project in another post.

Related

comments powered by Disqus