Checkpointing

Checkpointing is a way to add fault tolerance to your code by saving snapshots of the state, so it can restart from that point in case of failure. If your code runs for a long time, you may want to implement checkpointing so you can use the backfill queue.

  • Most efficient way is to do this explicitly: given an algorithm, save the state/variables necessary to restart the algorithm at regular intervals and implement an appropriate restart function. Trying to integrate checkpoint a posteriori is difficult, so this is a good thing to think about when starting new projects.
  • Use a checkpointing library. Checkpointing can be implemented by saving the entire binary state of a program either by means of a daemon or by means of direct instrumentation via an API. Note: JAVA checkpoints are portable because of the JVM, but for C/C++ this is not the case. You will need to restart on the same type of CPU architecture on which you checkpointed.

Note: Checkpointing may be complicated if your program uses a ton of memory or if you are trying to checkpoint a parallel code.