%0 Journal Article %T Fault Tolerance In Grid Computing: State of the Art and Open Issues %A Ritu Garg %A Awadhesh Kumar Singh %J International Journal of Computer Science and Engineering Survey %D 2011 %I Academy & Industry Research Collaboration Center (AIRCC) %X Fault tolerance is an important property for large scale computational grid systems, wheregeographically distributed nodes co-operate to execute a task. In order to achieve high level of reliabilityand availability, the grid infrastructure should be a foolproof fault tolerant. Since the failure of resourcesaffects job execution fatally, fault tolerance service is essential to satisfy QOS requirement in gridcomputing. Commonly utilized techniques for providing fault tolerance are job checkpointing andreplication. Both techniques mitigate the amount of work lost due to changing system availability but canintroduce significant runtime overhead. The latter largely depends on the length of checkpointing intervaland the chosen number of replicas, respectively. In case of complex scientific workflows where tasks canexecute in well defined order reliability is another biggest challenge because of the unreliable nature ofthe grid resources. %K Grid Computing %K Fault Tolerance %K Workflow Grid %U http://airccse.org/journal/ijcses/papers/0211cses07.pdf