ARTÍCULO
TITULO

Record-and-Replay Techniques for HPC Systems: A Survey

Dylan Chapp    
Kento Sato    
Dong H Ahn    
Michela Taufer    

Resumen

Record-and-replay techniques provide the ability to record executions of nondeterministic applications and re-execute them identically. These techniques find use in the contexts of debugging, reproducibility, and fault-tolerance, especially in the presence of nondeterministic factors such as message races. Record-and-replay techniques are highly diverse in terms of the fidelity of replay they provide, the assumptions they make about the recorded application, the programming models they target, and the runtime overheads they impose.                                                                           In the high performance computing (HPC) environment, all the above factors must be considered in concert, thus presenting additional implementation challenges. In this manuscript, we survey record-and-replay techniques in terms of the programming models they target and the workloads on which they were evaluated, providing a categorization of these techniques benefiting application developers and researchers targeting exascale challenges. This manuscript answers three questions through this survey: What are the gaps in the existing space of record-and-replay techniques? What is the roadmap to widespread use of record-and-replay on production-scale HPC workloads? And, what are the critical open problems that must be addressed to make record-and-replay viable at exascale?                                             Keywords: Reproducibility, nondeterminism, fault-tolerance, exascale, message-passing, shared memory, proxy application, HPC benchmarks

 Artículos similares