Reducing Checkpoint Overhead in Grid Environment

Authors

  • A. S. Faki and R. G. Jimoh

Abstract

Grid \u00a0Computing \u00a0has \u00a0become \u00a0major \u00a0player \u00a0in \u00a0super-computing \u00a0community. \u00a0But \u00a0due \u00a0to \u00a0the diversity \u00a0and \u00a0disruptive \u00a0nature \u00a0of \u00a0its \u00a0resources, \u00a0failure \u00a0of \u00a0jobs \u00a0is \u00a0not \u00a0an \u00a0exception. \u00a0 \u00a0However, \u00a0many researchers \u00a0have \u00a0come \u00a0up \u00a0with \u00a0models \u00a0that \u00a0enhance \u00a0jobs \u00a0survivability. \u00a0Popular \u00a0among \u00a0this \u00a0model \u00a0is checkpoint model which have the ability of saving already computed jobs on a stable secured storage. This model avoids re-computing of already computed jobs from the scratch in case of resources failure. But the time a job takes in checkpoinitng also becomes another task which adds overheads to computing resources thereby reducing the resources performance. In order not to add too many overheads to computing resources, the \u00a0number \u00a0of \u00a0checkpoints \u00a0must \u00a0be \u00a0minimized. \u00a0This \u00a0study \u00a0proposed \u00a0checkpoint \u00a0interval \u00a0models \u00a0which \u00a0is implemented based on fault index history of computing resources. Failed jobs are re-allocated from their last saved checkpoint using an exception handler. The study observed that arithmetic checkpoint model is better used when fault index of computing resources is high while geometric checkpoint model is better when fault index of resources is low.

Published

1970-01-01

Issue

Section

Articles