Generalized birthday problem

This project comes from the coding class I am taking this semester. The simplest birthday problem is as followed: how many people should be in a room such that the probability of a birthday shared by at least two persons is just above 0.5? I found this problem is very interesting. Throughout the process dealing this problem, some powerful R packages and essential coding skills are incorporated for more efficient simulation. In this project, I will try to address generalized birthday problem in an intuitive way (easy coding but computer expensive and slow) to somehow more advanced approaches.

What’s probability of one birthday shared by two persons in any size of room?

The probability we are interested calculated such that a birthday shared by at least k persons. The targeted probability p here is 0.5. For each room size i, r replicates are tested.

In the following case, I did only 20 replicates.

k <- 2                # Birthday shared by k persons
p <- 0.5              # Targetd probability
r <- 20               # Replicates
result <- rep(0, 100) # A vector for saving result 
i <- 10                  # Room size

Now here is trick. For each room size, create a single matrix that has r rows and i columns. The next step is to check by row: if there are at least k=2 persons in one room (i=10) sharing the same birthday, return TRUE. Use apply to implicitly loop through each row. The result probability would be the sum of this vector divided by total replicate r.

set.seed(12345)
birth <- matrix(sample(1:365, r*i, replace=TRUE), nrow=r, ncol=i)
birth

The probability for 2 persons sharing the same birthday in a room with 10 persones is roughly 0.1.

sum(apply(birth, 1, function(x) max(tabulate(x))>=k )) / r

Now test different room sizes. Say I wanna test room sizes ranging from k to 100 (testing room size smaller than k is nonsense). Use for loop to go through room size from k to 100.

k <- 2                # Birthday shared by k persons
p <- 0.5              # Targetd probability
r <- 100000           # Replicates
result <- rep(0, 100) # A vector for saving result 

system.time({         # Record time used
for (i in 1:100) {    # Test room sizes from 1 to 100 persons
  birth <- matrix(sample(1:365, r*i, replace=TRUE), nrow=r, ncol=i)
  result[i] <- sum(apply(birth, 1, function(x) max(tabulate(x))>=k )) / r
}
})

This simulation is extremely slow. Something is pretty obvious, for example, if I already know the result is going to be 23 just after calculating it, I don’t need to do the else (24-100), which consumes tons of time. For the rest of this post and maybe the other post, I will introduce some ways to speed up the simulation or make it more accurracy.

plot(result, type="l", xlab="room size", ylab="probability", main="probability of a birthday shared by k persons")
abline(h=p, col="red") 

The answer is 23 persons.

min(which(result>p))

Replicate size

Same logic can be applied to replicate size. We only need to increase replicates size when close to the target probability. If we know there are still hundreds of rooms size between current roome size and ideal room size, simulating a room size for millions times seems to be bad idea. As we are getting closer to the ideal room size, accuracy becomes more and more important. Accuracy may not be a big deal when k is 2 or 3, since the ideal room size is small. However, for larger k, for example, 8, then the efficiency become important for the first few steps.

To gradually increase replicate size, a sigmoidal curve might be a good idea.

B_l <- 1000
B_u <- 20000
x <- 1:500
y <- floor(-B_u/(1+exp(-.03* (x-200))))+(B_u+B_l)
plot(x, y, type='l', xlab='distance to ideal room size', ylab='replicate size')

When we are approximately 400 persons away the ideal room size, just do 1000 replicates. Whereas almost being to the goal, try 20000 replicates. Then the function can be modified as followed.

gbp <- function(k=2, p=.5, B_l=1000, B_u=20000, threshold=.001) {
  i <- k                       # Tested room size; start from k persons in a room
  x <- 1                       # Counter for while loop
  L <- 0                       # Lower bound for binary search
  R <- k^4+100+sample(-k:k, 1) # Upper bound for binary search
  
  # Create an empty data frame for result
  result <- setNames(data.frame(matrix(0, 100, 3)), c("room", "p", "replicate"))
  
  while (abs(result$room[x] - p) >= threshold) {
    # Increase replicate when approaching target room size
    r <- floor(-B_u/(1+exp(-.03* ((R-L)-200))))+(B_u+B_l)
    result$replicate[x] <- r
    
    # Test room size i
    i <- ceiling((L+R)/2)
    result$room[x] <- i 

    # Calculate probability
    birth <- matrix(sample(1:365, r*i, replace=TRUE), nrow=r, ncol=i)
    result$p[x] <- sum(apply(birth, 1, function(x) max(tabulate(x))>=k )) / r
    if (result$p[x] >= p) R <- i
    if (result$p[x] < p) L <- i
    
    # Break the loop if trapped
    if(sum(duplicated(tail(result$room[1:x], 2))) == 1 & result$room[x] != 0) break
    
    # Print out room sizes tested
    cat(paste0('Room size tested:', i, '\tprobability:', round(result$p[x], 3), '\treplicate size: ', r, '\n'))
    
    # Counter for steps
    x <- x + 1          
 }
  ans <- R
  return(list(answer=ans,
              result=result[1:x,]))
}

system.time(y<-gbp(k=8, p=.5))
y

I will continue the rest of this project in another post.

Chang-Yu Chang
Chang-Yu Chang
PhD candidate in Ecology and Evolutionary Biology