Discusses faults, errors and failures, hardware fault tolerance, reliability, availability, reliable distributed systems, checkpointing and recovery, atomic actions data and process resiliency, and software fault tolerance. Uses case studies. Prerequisite: Permission of the instructor.