Run R script on HPC

I was trying to run simulation about my migration project on Yale HPC (high performance cluster). After spending a day working on how to do it, I think it’s probably a good time to take a note and dust off my website.

I am not going to go through how to access HPC, because the cluster maintainer already has the user guide, like this provided by Yale Center for Research Computing. I will spend more (basically all) time on running a r script on HPC.

Run R on HPC

Load R from module

In the cluster I use (Grace), there is no pre-intalled R sofeware. Instead, and like some bioinformatics tools I used in the genomics class, I have to load the module that includes the R sofeware. My understanding is that the software is not available in the interactive mode (or the computer node I am currently using for interactive node) unless the corresponding module is loaded. After logging in the cluster, search for the module by the following bash command.

module avail R

Scrolling down the list a bit, it seems the module for R is Apps/R/3.4.3-generic. Let’s load the module in bash.

module load Apps/R/3.4.3-generic

Now simply type capital R in command line, you can play with R on cluster.

Install packages

One big advantage for R is the powerful packages. For my simulation, I use three packages tidyverse, deSolve, and data.table because I need to read and write big dataset, do data formating, and solve ODE. Let me deviate a bit here. data.table and tidyverse and both pretty powerful packages for data science. They share some similar functions and do mostly work well with each other. I usually use the function that come up in my brain first. For example, I use the pipe %>% feature from tidyverse and the fast reading and writing functions like fread and fwrite from data.table.

Go back to HPC. The way to install R package on cluster is fairly easy. Launch R and install the packages by install.packages().

install.packages(c("tidyverse", "deSolve", "data.table"))

Select the mirror and wait for downloading. This step is pretty much the same as that we usually do in our personal computer. The downloaded package would be stored in the directory, something like ~/R/x86_64-pc-linux-gnu-library/3.4/

Launch R and load the packages by library(). Now you can enjoy the package on HPC.

library(tidyverse)

Run R script with arguments in command line

Now I have a R script that does my simulation given the necessary parameters. The parameters, for example, the flow rates or migration rates are something I would like to modify and test within certain parameter space. It would make life easier if these parameters can be changed outside R script so that I don’t need to edit the script every time. How to add this feature to the script?

Let’s start with a simple example. Say, I want a R script to print out whatever I input as arguments from command line. First, I create a R script in which the script prints out the argument. In this script, named as test.R, there are only two lines.

args <- commandArgs(trailingOnly = TRUE)
print(args[1])

The first line takes the arguments from command line and saves them into a vector call args. The second line just prints out the input argument. For example, type in the command below in the command line.

R test.R hello

It will prints out “hello”. To be noted, all the arguments would be saved in the form of vector (args here) and each elements are character. That means if you are going to input arguments that are supposed to run in R as number or boolean, it has to be specified. I used the base function as.numeric() and as.logical()

To clarify, although I would like to run the simulation on cluster, note that this method is not limited to only cluster. It can also be applied to your computer. But what’s point for this if I have a great GUI like Rstudio? One reason is that the process is ever faster in terminal than Rstudio. I was totally amazed by how quick the simulation was done. With no background in computer science, my naive guess is that using GUI is slower because GUI has an additional step between low-level computing process and high-level graphical interface, which causes more computer time.

Run R in batch mode

This section is now addressing the real power of HPC (probaly some, at least better than burning my laptop). But wait, what is batch mode?

Node, core, and other jargons

As usual, here is the terminology section. Some of them are from here.

  • node: a computer. One computer can take one task you assign for it.

  • core: the processing part in a computer. Some commands use multicore to speead up the process. R can do it too, by using packages like doMC or doSNOW.

  • interactive mode: when logging in the cluster, using interactive mode means you are controlling one node in the cluster via your terminal.

  • batch mode: the opposite of interactive mode. Batch mode means the computer(s) process the task without human intervention. This is done by submitting a batch job to the cluster for requesting several nodes (more computers!) and several cores (more processing parts!).

  • batch job: the batch job contains several tasks, and each task is processed by one node. Usually it’s a text file that tells cluster how to do the job.

Submit the batch job

Yale HPCs use Slurm to manage batch job. I use module SimpleQueue to help me generate the batch job file and submit batch job. How to do it? First, create a text file which contains the tasks that you want to do in bash environment. That means the syntax is bash. In this file, each line is a task, which will be later processed by a node. For instance, I want to run three independent tasks, and each of them will execute the test.R file with a single input argument. In a task, commands are separated by colon, like what you do to run two commands in a line in bash. In this text file, there are three lines.

module load Apps/R/3.4.3-generic; Rscript test.R hello
module load Apps/R/3.4.3-generic; Rscript test.R hey
module load Apps/R/3.4.3-generic; Rscript test.R hi

Note that you have to load the R module (or any module you need) every time. Like what we do in the interactive mode, the R module has to be loaded each time when logging in the cluster. Later each task in this text file will be executed by one independent node, that’s why module loading is in each task.

And you might notice, this example will not return any visible result for you. Because these commands only simply print out the argument I gave. Because everthing is run under batch mode, we cannot see the printed result. The result will only be saved if you specify writing functions in the R script.

So now we have a text file with the tasks we want to submit. Let’s name it myTask.txt. Go back to the commmand line and type the following commands.

module load Tools/SimpleQueue/3.0
sqCreateScript -n3 -c1 -m5G testBatch myTask.txt | sbatch

Load the SimpleQueue module and create a batch job script by sqCreateScript. Since I have three tasks and I wanna run them in parallel on three nodes, I request three nodes by specifying -n3 and c1 for one core/node. The script is then piped (see the symbol |) to sbatch for submission. The submission is done!

The progress can be check by squeue. Specify the user name, which is netID for Yale clusters.

squeue -u <NetID>

Pipeline

If I want to run simulation with multiple parameters on HPC, here is the pipeline I use.

  1. Write a R script that does something.

  2. Figure the arguments that need flexible change. Use commandArgs() to input the set of arguments from command line. This step may not be necessary.

  3. Test the R script in my laptop for model robustness. Try to make the script workable. Well, although once the parameters in the model are scaled up, bug will definitely popping out. That is the point (or joy) of using cluster. Debug and debug.

  4. Create a text file in which each line contains one task (with many bash commands) that is supposed to be done in one node.

  5. Submit the batch job using SimpleQueue module.

Related

comments powered by Disqus