Fault tolerance
CS 300 (PDC)
Four goals for dependable systems:
Availability: readiness to be used at any given point in time.
Reliability: ability to run continuously over a time interval without failure.
Safety: ability of a system to tolerate temporary failures.
Maintainability: ease in repairing a failed system.
Failure, fault, error:
Fault tolerance is the ability of a system to continue to function according to spec (avoid failing), even in the presence of faults.
Duration of faults.
Transient fault: fault occurs once, then doesn't recur.
Intermittent fault: fault occurs, then vanishes on its own, the reappears, etc. (Most frustrating.)
Permanent fault: fault continues until component repaired.
Failure models
Crash failure: system halts, but works correctly until it halts.
Omission failure: some step is omitted in a system's algorithm. Examples: server fails to perform a request; a Recv fails; a Send isn't sent.
Timing failure: an action doesn't occur within a specified time. (Real-time system: timing is part of the spec for a system.)
Response failure: an action is performed incorrectly. Example: a server sends wrong value or performs incorrect control flow.
Byzantine failure or arbitrary failure: a system may produce arbitrary responses at arbitrary times.
Eight fallacies of distributed computing (Deutch) (see also Fallacies of Distributed Computing Explained)
Strategies for achieving fault-tolerance: Redundancy (incl. groups); protocol; (automatic) checkpointing
______