Final Report Outline
(Draft in Progress)
Four Pieces:
- Executive Summary
- Research Solicitation
- Program Vision (main description)
- Appendices
Executive Summary
Target: 2p (starting point might be vision document from June)
(It would be good to have a pithy congressional slogan. Extending the technology revolution?)
- scaling good, computers everywhere...critical to our lives and economy at many levels
- challenges ahead due to reliability and power
- hints there is something we can do
- need research to develop new solutions
- here's what we think we can do for you
- need leadership (government leadership)
Research Solicitation
Target: 2p
- outline key challenge ahead
- kind of research solicited
- program structure (? program coverage of promising directions?)
- evaluation criteria
Program Vision
- What are you trying to do?
- Allow continued scaling benefits
- Reduce energy/operation
- Reduce $$/gate
- Increase ops/time with limited power-density budget
- While maintaining or improving safety
- Navigate inflection points in energy and reliability
- Allow continued scaling benefits
- Why now?
- inflection points in reliability, energy
- critical deployment of computation
- system size?
- How is it done today?
- Demand reliable, consistent device operation
- Margin for worst-case device effect Of billions, over multi-year lifetime
- Discard components when devices fail
- System-level redundancy
- The niches where above is not good enough are small but important (avionics, medical)
- Spend considerable $$, energy for reliability
- E.g. Brute-force replication
- Many-year performance lag behind commercial systems
- Demand reliable, consistent device operation
- Trends?
- power limited trends? ... gap from margining?
- voltage scaling (ITRS)
decreased dopants --> variability (ITRS)
- roadmap work on rate of variation-induced defects
- increasing transistors/chip
- increasing system system sizes (supercomp, data centers)
- decreased opportunity for burnin
- increase wearout effects
decreased critical charge --> increased upset susceptibility
- roadmap work on {intrinsic,extrinsic} upsets?
- GDP in electronics?
- electronics in critical systems
- What can we accomplish?
- Build reliable systems from unreliable components
- Efficiently compensate for unpredictable devices through cooperation at higher levels of system stack
- Ground goals
- scale how much further?
- allow how many more ops/Joule?
- how close to raw scaling?
- extend component life by how much?
- ...more... (depend on / synch with challenges)
- Build reliable systems from unreliable components
- What's new? (Ideas and promising directions)
- Ubiquitously/pervasively exploit:
- Cross-layer codesign --- Multi-level tradeoffs (generalization of hardware/software)
- (incl. tools to support, system abstractions, algorithm design)
- Design prepared for self-assessment of safety margins and repair
- Cooperative filtering of errors at multiple levels
- Strategic, low-overhead redundancy
- Differential reliability
- Scalable and adaptive solutions
- Cross-layer codesign --- Multi-level tradeoffs (generalization of hardware/software)
- Ubiquitously/pervasively exploit:
- Why do this?
- Reliability matters for everything moving forward...just a matter of how much (in the past, for a large class of applications and system sizes, the base technology was reliable enough that there was no need to address reliability in the design; as we move forward, between increasing technology noise and increasing base system sizes, it must be addressed in design for all applications.)
- Allow scaling to continue without sacrificing safety
- Continued reduction in energy/op
- Continued reduction in $$/op
- Maintain or extend component lifetimes
- How much further?
- Allow construction of larger, dependable systems
- Make infrastructural technology worthy of the trust we place in it
- Specific big wins (challenges overcome) from focus groups?
- feasible to fly commercial? and/or have any access to most advanced technology? (close commercial/aerospace component gap?); allow areospace to exploit modern electronics?
- advanced technology safe for drive-by-wire?
- enable larger (more components, computation) medical devices?
- enable supercomputers able to solve XXX problems?
- ??? review focus group output and select appropriate to highlight here ???
- Can't have security without getting resilience under control
- Why government leadership?
- Challenge problems and areas of pain
- Common/cross-cutting challenges
- By focus group (5)
- Commercial
- Address growing reliability challenge with small enough overhead to avoid negating benefit of scaling
- Reduce energy per operation while retaining reliable operation
- Maintain or extend lifetimes in face of increasing wear effects
- Economically address demand for components with different reliability needs
- Navigating complex, multidimensional design space
- Aerospace
System lifetimes >> changes in political and scientific need
- Navigating complex, multidimensional design space
- Widening gap between commercial and mil/areo components
- Design for (uncommon) worst-case environment
- Bottleneck in testing
- Focus on part reliability over system reliability
- Large Scale
- Life Critical
- Infrastructure
- Commercial
- Big Science Questions?
- How do we organize, manage, and analyze layering for cooperative fault mitigation?
- How do we best accommodate repair?
- What is the right level of filtering at each level of the hierarchy?
- Can we establish a useful theory and collection of design patterns for lightweight checking?
- What would a theory and framework for expressing and reasoning about differential reliability look like?
- Can a scalable theory and architectures that will allow adaptation to various upset rates and system reliability targets be developed?
- Mission Impacts? (perhaps related to stuff trying to summarize above under specific challenges overcome---but, the detail support goes here; this is also the challenges/targets/opportunities for more mission-oriented agencies)
- security
- cyberphysical
- satellite
- supercomputer
- ???
- Critical Questions? (big risk items? ... more strategic questions?)
- enable concurrent research in understanding low-level upset, fatigue effects with ever changing technology along with high-level mitigation
- manage developer burden (avoid increasing)
- must be careful pushing more complexity into software, when haven't mastered software reliability
- ??? others
- Metrics, Goals, Measure and manage programs
- Goals of metrics
- assess if proposed research proposing to attack the right problems?
- measure if research making progress on solving the problem?
- Some possible, primary metrics
- Energy/Op at noise rate and performance target (Noise rate: defects, variation wear, transients)
- Post-fab adaptability to range of noise rates
- Timeliness and quality of adaptation
- Recommendations from metrics group
- Goals of metrics
- Research Organization and Infrastructure
- Examples and Illustrative Scenarios
- Processor
- SoC
- High-level software
- Layer discussion and community contributions: probably good to have a clear description of the layers and callout the wide range of communities that could participate in this research
- Process and Participants
- summarize activities of group (meetings, wiki, focus groups...)
- comprehensive list of participants
Appendix
- focus group reports