Fault tolerance (CS 300 (PDC))

Fault tolerance

CS 300 (PDC)

Introduction to fault tolerance

Four goals for dependable systems:
- Availability: readiness to be used at any given point in time.
- Reliability: ability to run continuously over a time interval without failure.
- Safety: ability of a system to tolerate temporary failures.
- Maintainability: ease in repairing a failed system.
Failure, fault, error:
- A system fails if it does not satisfy its spec.
- A fault is a cause of error.
- An error is a part of a system's state that leads to failure.
Fault tolerance is the ability of a system to continue to function according to spec (avoid failing), even in the presence of faults.

Duration of faults.
- Transient fault: fault occurs once, then doesn't recur.
- Intermittent fault: fault occurs, then vanishes on its own, the reappears, etc. (Most frustrating.)
- Permanent fault: fault continues until component repaired.
Failure models
- Crash failure: system halts, but works correctly until it halts.
- Omission failure: some step is omitted in a system's algorithm. Examples: server fails to perform a request; a Recv fails; a Send isn't sent.
- Timing failure: an action doesn't occur within a specified time. (Real-time system: timing is part of the spec for a system.)
- Response failure: an action is performed incorrectly. Example: a server sends wrong value or performs incorrect control flow.
- Byzantine failure or arbitrary failure: a system may produce arbitrary responses at arbitrary times.
Eight fallacies of distributed computing (Deutch) (see also Fallacies of Distributed Computing Explained)
Strategies for achieving fault-tolerance: Redundancy (incl. groups); protocol; (automatic) checkpointing
______

______