Differences between revisions 11 and 12

Summary of Third Meeting (Oct 29--30, 2009, Austin, TX)

Goals

This final study meeting had two goals: understanding the constituency groups that had not presented at previous meetings, and crafting a plan for how the results of the study would be written up and presented to the CCC and funding agencies. Presentations from the life-critical systems and infrastructure working groups outlined the key issues facing their communities and some overlaps with other communities. Program managers from the NRL, NSF and DARPA attended the meeting, and provided feedback on how the study's results could be made most useful to them. In particular, a number of participants suggested that the study group propose multi-agency program to fund cross-layer resilience research, and much of the later discussion focused on ways to pursue this suggestion.

Tuning up Story

We started the workshop as we have the other two workshops by telling the cross-layer visioning story for the series of workshops. Presenting the 10-20 slide story allows us to provide a basis for first time attendees and to refine the story we'll be telling funding agencies. As always, this presentation starts the discussion on reliability challenges and cross-layer reliability approaches. There were a number of suggestions that the audience provided:
- # They felt that we should more crisply define the goal of the research. This included showing how the errors in the logic could be handled by higher layers. # We were also reminded that NSF has a strong education focus. # They felt that we needed to provide more context to the fault rate so that people could understand what that means in the large picture. (e.g., This means your computer reboots once an hour? ...replace your processor every 10 minutes?) (maybe a yield-based example -- NPC) # There was also a lengthy discussion of whether the mission impacts should only be energy or should include economic factors. Discussion regarding the ability for the U.S. to compete in the global economy came up several times. # It was also stated that all of this work should be communicated as revolution rather than a revision.

Framing for public: Immuno-Logic

Immune system analogy: layered defense system
- physical
- innate
- adaptive
sell story
- create value in US
- competitiveness
- jobs
- protect us

Government/Strategy

NSF
- talk to Engineering
- come back to NSF with SRC as partner
SRC (Semiconductor Research Corporation)
- get quotes form principals at IBM, Intel, Freescale
- potential for new companies (e.g. CISCO)
- customer directed funding

Education

resilience not currently being taught in EE, CS education
- System Reliability exist as discipline, but not part of EE, CS education
how educate computer engineers on system reliability?
K-12 inspiration (FIRST, CubeSATs)
Competitions (c.f. best branch predictor..., DARPA Grand Challenge)
Added discussion page on Wiki to develop further

Research Organization

needs cross-cut work; not pushed forward by single-domain projects
big teams and centers?
serialization of research not an option (e.g. get accurate model of device effects then do arch/software design)
- data for process reliability is late---never have it before start design
standard platforms/models---maybe preference to fund sharable platform work first?
relation/tie-in-to power management infrastructure
benchmarking/contests (something to fund early)

Life Critical

standards (IEC 61508)
technology
- want to see history of technology before adopt
  - deliberately forgoe using most advanced technology so know more about how it will behave
  - would benefit from more operational data published (available) from existing technologies
- automotive in production with 130nm (just starting to look at 90nm)
- medical 250nm smallest in production now (looking at 90nm)
- medical may never adopt 45nm
automotive (ISO 26262)
- requirement is 0 PPM failure (but doesn't say 0 PPB, so sounds like 0.1 PPM acceptable)
- 20 year life
- probability of dangerous failure/hour < 10^-7
- always need more performance
medical
- highly regulated
- driven by low-power needs --> battery must last 5-10 years
- electronic processing needs/content is growing
- transients lead to power-on-reset
  - seeing increase in soft errors
  - soft errors now in PPM range (AMD: not sure what that means? needs a time component? per million hours???)
  - sometimes physician response to transient is to explant device
need for ways to demonstrate/quantify resilience
differences with other constituencies? (things they want to highlight)
- beyond silicon---also analog, passives
  - (not a difference, also hear from computer, aerospace, just not focus of this effort -- note DARPA HEALICS)
- need better reliability than current silicon (also not clearly a difference)

Infrastructure

physical distribution of sites (access/maintenance, delay in information propagation)
once deployed, seldom removed from service
- drives heterogeneity in system (many different generations)
- drives need for flexibility/adaptability (old stuff may never completely be replaced)
compute << machine being controlled (cost of compute not dominate plant)
degraded fallback possible (incl. fail off)
autonomous (power scavenged) monitoring
availability key metric (short fail-over tolerable)

Roadmap

still plan to add extrinsic noise model; will shift curves upward (more problematic)

Metrics

composition calculation must be more than summing FITs (otherwise, not account benefit of mitigation techniques)
FIT rate decompose
- persistent vs. transient
- impact
  - detected and corrected (slowdown)
  - failure of one application or VM (IBM speak: partition)
  - failure of full system
  - silent data corruption (SDC)
availability metrics (2D matrix -- event-rate and time-to-recover at each type/rate)
standardizing FIT ranges?
measure resilience of common things [in beam?] as an eye opener (map etherial into things people might understand)
- e.g. house alarm (fire? intruder?), voting machine, Point-of-Sale terminal?)
- step toward consumer reports or standards...
specmark for reliability

Addressing Challenges? (orphaned point)

public health system

Next Steps / Our path forward

reports from constituency groups by Dec. 1 (?)
quotes from executives soon (mid. Nov.?)
workshop summary (this) distributable by end of Nov.
2p executive summary during November
DATE papers (driver for draft of key pieces) by end of Nov.
full report draft assembled in January? (cleaned up/polished in February?)
input to SRC December?
lobbying SRC, DARPA, NSF, others? ...

-  ⇤ ← Revision 11 as of 2009-11-17 20:06:21 → 
  Size: 7347
  Editor: HeatherQuinn
  Comment:
+   ← Revision 12 as of 2009-11-17 20:06:51 → ⇥
  Size: 7360
  Editor: HeatherQuinn
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 11:
- # They felt that we should more crisply define the goal of the research.  This included showing how the errors in the logic could be handled by higher layers.

 # We were also reminded that NSF has a strong education focus.
 # They felt that we needed to provide more context to the fault rate so that people could understand what that means in the large picture.  (e.g., This means your computer reboots once an hour?  ...replace your processor every 10 minutes?) (maybe a yield-based example -- NPC)
 # There was also a lengthy discussion of whether the mission impacts should only be energy or should include economic factors.  Discussion regarding the ability for the U.S. to compete in the global economy came up several times.
 # It was also stated that all of this work should be communicated as revolution rather than a revision.
+    # They felt that we should more crisply define the goal of the research.  This included showing how the errors in the logic could be handled by higher layers.
    # We were also reminded that NSF has a strong education focus.
    # They felt that we needed to provide more context to the fault rate so that people could understand what that means in the large picture.  (e.g., This means your computer reboots once an hour?  ...replace your processor every 10 minutes?) (maybe a yield-based example -- NPC)
    # There was also a lengthy discussion of whether the mission impacts should only be energy or should include economic factors.  Discussion regarding the ability for the U.S. to compete in the global economy came up several times.
    # It was also stated that all of this work should be communicated as revolution rather than a revision.