= Cross-Layer Optimization to Address the Dual Challenges of Energy and Reliability =

== DATE Session 8.2 ==

(This session to be held at [[http://www.date-conference.com|DATE 2010]], March 8-12, Dresden, Germany.)


Increasing unpredictability threatens our ability to continue scaling integrated circuits at Moore's Law rates.  As the transistors, wires, and other components that make up integrated circuits become smaller, they display both greater variation (differences in behavior between devices designed to be identical) and greater vulnerability to transient and permanent faults.  Conventional design techniques expend energy to tolerate this unpredictability by either replicating circuitry or by adding safety margins to a circuit's operating voltage, clock frequency or charge stored per bit of data.  Such approaches have been effective in the past, but their costs in energy and system performance are rapidly becoming unacceptable, particularly in an environment where power consumption is often the limiting factor on integrated circuit performance and energy efficiency is a global concern.

To continue scaling and energy reduction, we can no longer assume that devices will be fabricated perfectly and identically or that circuits will operate without transient upsets.  Higher layers in the system stack (architecture, firmware, OS, compilers, and applications) must co-operate to mitigate these unpredictable effects efficiently.  Reliability must be a first-order concern that is optimized as a design metric along with energy, delay, area, and thermal profile by design automation tools.  These tools must explore a larger space of optimizing transformations including tradeoffs across the layer stack and continuing optimization throughout the component's operational lifetime.

Sample multi-layer techniques include:
 * combining repairable architecture organizations that can avoid bad or excessively high energy devices with detection software to identify in-system failures and an operating system that can orchestrate repair.
 * using application-provided invariants and self checks (information margins) to detect errors and trigger firmware-assisted recovery routines. This will allow circuits to run at minimum energy rather than spending energy margins to guarantee upsets never occur.
 * strategically allocating functionality across layers and passing information among them, avoiding the need to pay large energy overhead costs for uncommon events.  
 * exploiting differential reliability by designing a small number of circuits that use larger feature sizes or higher energy to guarantee their reliability and then using those circuits to supervise and check the operation of a much larger number of low-energy, error-prone circuits.  
 * architectures, applications, and operating systems cooperating to adapt the level of protection they provide to the requirements of the application (compare a 911 call to a game of solitaire), the reliability of the circuits and devices (which may change with aging), and the environment of operation (sea level vs. high-altitude and/or temperature), using the minimum energy to maintain the required operational reliability.

In the realm of memory and communication, we have a long history of success tolerating unpredictable effects including fabrication variability, transient upsets, and lifetime wear by using strategic information and multi-layer approaches that anticipate, accommodate, and suppress errors.  In memory devices, error correcting codes have been useful at correcting single bit errors through an increase in the amount of information used to store each word.  With this type of mitigation, there is a 100% reduction in single bit errors with only a 12% increase in both energy and information for a 64-bit word.  Use of these techniques to tolerate occasional and unpredictable deviations from intended function has resulted in reduced cost, reduced energy, and increased performance while guaranteeing robust operation in the presence of noise.  Unfortunately, mitigating errors in logic is not as simple or as well-researched as memory or communication systems.  This lack of understanding has lead to very expensive solutions, such as triple-modular redundancy where the computation is done three times so that erroneous calculations can be voted out---at a cost of 3x the energy of the base computation, a 200% energy margin.  We believe there is ample need and opportunity to bring these kinds of efficient, cross-layer solutions to computation.

This special session describes the vision and the need for new design automation. It summarizes findings and vision from an [[http://www.relxlayer.org|ongoing study of cross-layer reliability]].  
 1. [[attachment:roadmap.pdf|Reliability Roadmap]] (presenter: S. Nassif, IBM)
 2. [[attachment:vision.pdf|Vision]] (presenter: A. DeHon, University of Pennsylvania)
 3. [[attachment:techniques.pdf|Techniques and Examples]] (presenter: N. Carter, Intel)
 4. [[attachment:metrics.pdf|Metrics and Needs for Automating Cross-Layer Reliability Optimization]]  (presenter: S. Mitra, Stanford)