Topic: “Submitting Checkpoint/Restart Jobs on SHARCNET with BLCR”
Speaker: Doug Roberts, SHARCNET
Webinar link: SN-Seminars Vidyo room
In this webinar we demonstrate how to checkpoint and restart serial or threaded jobs submitted to the queue on SHARCNET clusters without performing any modifications to the application source code. To do this we focus on a software package and module installed on SHARCNET known as BLCR (Berkeley Lab Checkpoint/Restart). This tool performs checkpoint restarts inside the Linux kernel and while this makes it less portable than solutions which use user-level libraries, it has full access to all kernel resources, and thus can restore resources (like process IDs) that user-level libraries cannot and groups of
processes (such as shell scripts and their sub-processes) along with their pipes. BLCR provides a straightforward and powerful approach to add fault tolerance into your HPC computing daily workflow, so consider putting this talk into your November schedule.
Need help attending a webinar? See the SHARCNET Help Wiki.