Summary of Third Meeting (Oct 29--30, 2009, Austin, TX)
[AMD note: where possible, I've tried to pullout themes and order by importance rather than writing a chronological book report]
Goals
paragraph of framing for this meeting
Tuning up Story
- (some directions to improve intro (10-20 slide) story---important as this is refining the story we'll be telling funders...and may be similar template to some of the documents)
- more crisply define the goal; including: draw layers -- show error in logic; and handled by higher layers
- include education
- need to reflect fault rate up to system level for examples.
- (e.g., This means your computer reboots once an hour? ...replace your processor every 10 minutes?)
- should missions include {energy, economy} ?
- better communicate this as a revolution
Framing for public: Immuno-Logic
- Immune system analogy: layered defesne system
- physical
- innate
- adaptive
- sell story
- create value in US
- competitiveness
- jobs
- protect us
Government/Strategy
- NSF
- talk to Engineering
- come back to NSF with SRC as partner
- SRC
- get quotes form principals at IBM, Intel, Freescale
- potential for new companies (e.g. CISCO)
- customer directed funding
Education
- resilience not currently being taught in EE, CS education
- System Reliability exist as discipline, but not part of EE, CS education
- how educate computer engineers on system reliability?
- K-12 inspiration (FIRST, CubeSATs)
- Competitions (c.f. best branch predictor..., DARPA Grand Challenge)
- Added discussion page on Wiki to develop further
Research Organization
- needs cross-cut work; not pushed forward by single-domain projects
- big teams and centers?
- serialization of research not an option (e.g. get accurate model of device effects then do arch/software design)
- data for process reliability is late---never have it before start design
- standard platforms/models---maybe preference to fund sharable platform work first?
- relation/tie-in-to power management infrastructure
- benchmarking/contests (something to fund early)
Life Critical
- standards (IEC 61508)
- technology
- want to see history of technology before adopt
- automotive in production with 130nm (just starting to look at 90nm)
- medical 250nm smallest in production now (looking at 90nm)
- medical may never adopt 45nm
- automotive (ISO 26262)
- requirement is 0 PPM failure (but doesn't say 0 PPB, so sounds like 0.1 PPM acceptable)
- 20 year life
probability of dangerous failure/hour < 10-7
- always need more performance
- medical
- highly regulated
driven by low-power needs --> battery must last 5-10 years
- electronic processing needs/content is growing
- transients lead to power-on-reset
- seeing increase in soft errors
- soft errors now in PPM range (AMD: not sure what that means? needs a time component? per million hours???)
- sometimes physician response to transient is to explant device
- need for ways to demonstrate/quantify resilience
- differences with other constituencies? (things they want to highlight)
- beyond silicon---also analog, passives
- (not a difference, also hear from computer, aerospace, just not focus of this effort -- note DARPA HEALICS)
- need better reliability than current silicon (also not clearly a difference)
- beyond silicon---also analog, passives
Infrastructure
- physical distribution of sites (access/maintenance, delay in information propagation)
compute << machine being controlled (cost of compute not dominate plant)
- degraded fallback possible (incl. fail off)
- autonomous (power scavenged) monitoring
- availability key metric (short fail-over tolerable)
Roadmap
- still plan to add extrinsic noise model; will shift curves upward (more problematic)
Metrics
- composition calculation must be more than summing FITs (otherwise, not account benefit of mitigation techniques)
- FIT rate decompose
- persistent vs. transient
- impact
- detected and corrected (slowdown)
- failure of one application or VM (IBM speak: partition)
- failure of full system
- silent data corruption (SDC)
- availability metrics (2D matrix -- event-rate and time-to-recover at each type/rate)
- standardizing FIT ranges?
- measure resilience of common things [in beam?] as an eye opener (map etherial into things people might understand)
- e.g. house alarm (fire? intruder?), voting machine, Point-of-Sale terminal?)
- step toward consumer reports or standards...
- specmark for reliability
Addressing Challenges? (orphaned point)
- public health system
Next Steps / Our path forward
- reports from constituency groups by Dec. 1 (?)
- quotes from executives soon (mid. Nov.?)
- workshop summary (this) distributable by end of Nov.
- 2p executive summary during November
- DATE papers (driver for draft of key pieces) by end of Nov.
- full report draft assembled in January? (cleaned up/polished in February?)
- input to SRC December?
- lobbying SRC, DARPA, NSF, others? ...