Scenarios/S1 - Cross-Layer Reliability Wiki

Scenario 1

Summary

Description

As an example of how multi-level reliability might be implemented in a computing device, consider the CPU of a 2018-era computer that contains a memory hierarchy using the approaches described earlier and a number of I/O devices. Assuming current CMOS trends continue, such a CPU might have a die size of 1–2 cm² and contain between sixteen and sixty-four cores of similar complexity to current CPUs, a larger number of finer-grained execution units (e.g. GPUs, FPGA-like cells), or a mix of execution unit granularities. This CPU might be designed with a peak clock rate of 6-8 GHz, although power and thermal limitations would prevent it from operating all of its cores at their peak clock rates except in short bursts. Instead, the operating system will dynamically adjust the supply voltages and clock frequencies of the cores, tuning their throughput to meet the demands of the operating environment and the characteristics of the applications being executed without overheating the CPU.

Such a chip would prepare for repair by incorporating hardware structures that allow cores, ALUs, memory blocks, or other functional units to be disabled and isolated from the rest of the chip if they develop permanent faults¹. Multi-level rollback support would be possible through instruction squashing at the microarchitectural level¹ and hardware support for low-overhead checkpointing¹ for larger rollback windows at the OS- and application-level.

The CPU would also support strategic redundancy in critical modules. Register files and on- chip memories would employ ECC¹. The CPU would use light-weight detection mechanisms to catch errors in computations², such as residue arithmetic¹, parity bits on instruction words¹, operation sequence signatures¹, and a heartbeat timer¹ to detect software “hangs.” These mechanisms will detect and filter out the vast majority of transient errors and, in combination with test routines, will be used to diagnose permanent errors.

Further, the CPU will cooperate across multiple levels with the OS and applications to exploit differential reliability to make reliability/performance/power consumption trade-offs according to the needs of the application. By default, the OS might enable the CPU’s error detection and checkpointing hardware, assuming that the application does not contain any self checks, but is not critical enough to merit redundant execution of each operation. Applications that can tolerate some errors in their outputs, such as video playback¹, could inform the OS of this fact, allowing it to disable some of the reliability mechanisms to reduce power consumption¹. Applications that embed their own lightweight checks, such as ILP solvers that check the validity of the solution found¹, could disable most or all of the error-checking hardware except during consistency checks, while functional applications that do not modify their inputs or external state might choose to disable checkpointing and re-execute from the beginning if an error occurs. Conversely, applications that contain critical regions could request that the OS turn on redundant execution during those regions. For example, an on-line banking application might run with “normal” reliability settings while the user is checking balances and reviewing past statements, but turn on redundant execution during a balance transfer transaction.

When the processor, OS, or application detect an error in a computation, they will use multi- level schemes to diagnose and correct the error. The hardware records that an error has occurred, either by incrementing a hardware counter or signaling an exception. The faulty operation is reissued. If the operation completes correctly the second time, the faulty operation is logged as being an unrepeatable error. If the operation fails twice in a row or too many times in a particular window of time, the processor assumes that a permanent error has occurred and signals an exception to the operating system. Finally, the operating system responds by performing a local diagnosis and repair, migrating the computation to another core, reverting to its last checkpointed state, and/or terminating the application. At higher levels, the application can report suspicion of an error to the OS and request rollback and re-execution¹. The OS notices when the application is making re-execution requests too frequently. This, too, can trigger diagnosis, repair, and migration. In addition to these reactive diagnoses and migration, the OS will periodically migrate computations off of each core and invoke test routines on the core to detect permanent errors¹. During these tests, the OS may deliberately overstress the core by running it at accelerated clock rates in order to determine whether aging effects are decreasing the core’s performance to the point where it can no longer run safely at a reasonable level of performance or energy consumption¹. The results of these tests, as well as tests performed when the processor is fabricated, will be stored by the OS and used when making decisions about the operational clock frequency for each core and the assignment of tasks to cores.

The hardware and the OS will adapt to the error rates seen in the system. The CPU might include selectable ECC logic, allowing the system to conﬁgure different amounts of protection against soft errors depending on the needs of the application. Similarly, the CPU might add logic that allows groups of cores to check each other's results when running critical computations that demand particularly high reliability. The operating system will monitor the error rate logs, calculate the reliability of the system, and reconﬁgure the hardware to achieve a desired level of resilience. A system located at sea level might well be able to operate with fewer reliability mechanisms enabled than one located in Denver, Colorado, due to the greater rates of radiation-induced soft errors at Denver’s altitude¹, for example. Systems that experience substantial variation in error rates, such as airplane¹ or spacecraft control systems¹, could extend this capability by incorporating radiation detectors that allow them to respond more quickly to changes in soft error rates, avoiding the need to operate under worst-case assumptions at all times.

In-ﬁeld adaptation can also efficiently accommodate aging. As a part ages, it may exhibit higher fault rates demanding more frequent checkpoints and more heavy use of resilience mechanisms; since each component ages differently, we do not penalize the robust components for a fraction that age more quickly.

Overall, distributing reliability across the system stack provides both dependability and flexibility by allowing the system to tune itself based on its needs.

Comments

To comment, please add another bullet to this list.

References

Reference Needed (1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17)
Meixner, A.; Bauer, M.E.; Sorin, D.J., "Argus: Low-Cost, Comprehensive Error Detection in Simple Cores," Microarchitecture, 2007. MICRO 2007. 40th Annual IEEE/ACM International Symposium on, pp.210-222, 1-5 Dec. 2007 (18)