Differences between revisions 6 and 7
Revision 6 as of 2009-11-11 02:31:08
Size: 6781
Editor: AndreDeHon
Comment: updates from telecon
Revision 7 as of 2009-11-16 13:59:21
Size: 6654
Editor: HeatherQuinn
Comment:
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:

[AMD note: where possible, I've tried to pullout themes and order by importance rather than writing a chronological book report]
Line 7: Line 5:
  This final study meeting had two goals: understanding the constituency groups that had not presented at previous meetings, and crafting a plan for how the results of the study would be written up and presented to the CCC and funding agencies. Presentations from the life-critical systems and infrastructure working groups outlined the key issues facing their communities and some overlaps with other communities. Program managers from the NSF and DARPA attended the meeting, and provided feedback on how the study's results could be made most useful to them. In particular, a number of participants suggested that the study group propose multi-agency program to fund cross-layer resilience research, and much of the later discussion focused on ways to pursue this suggestion.   This final study meeting had two goals: understanding the constituency groups that had not presented at previous meetings, and crafting a plan for how the results of the study would be written up and presented to the CCC and funding agencies. Presentations from the life-critical systems and infrastructure working groups outlined the key issues facing their communities and some overlaps with other communities. Program managers from the NRL, NSF and DARPA attended the meeting, and provided feedback on how the study's results could be made most useful to them. In particular, a number of participants suggested that the study group propose multi-agency program to fund cross-layer resilience research, and much of the later discussion focused on ways to pursue this suggestion.

Summary of Third Meeting (Oct 29--30, 2009, Austin, TX)

Goals

  • This final study meeting had two goals: understanding the constituency groups that had not presented at previous meetings, and crafting a plan for how the results of the study would be written up and presented to the CCC and funding agencies. Presentations from the life-critical systems and infrastructure working groups outlined the key issues facing their communities and some overlaps with other communities. Program managers from the NRL, NSF and DARPA attended the meeting, and provided feedback on how the study's results could be made most useful to them. In particular, a number of participants suggested that the study group propose multi-agency program to fund cross-layer resilience research, and much of the later discussion focused on ways to pursue this suggestion.

Tuning up Story

  • (some directions to improve intro (10-20 slide) story---important as this is refining the story we'll be telling funders...and may be similar template to some of the documents)
  • more crisply define the goal; including: draw layers -- show error in logic; and handled by higher layers
  • include education
  • need to reflect fault rate up to system level for examples.
    • (e.g., This means your computer reboots once an hour? ...replace your processor every 10 minutes?) (maybe a yield-based example -- NPC)
  • should missions include {energy, economy} ?
  • better communicate this as a revolution

Framing for public: Immuno-Logic

  • Immune system analogy: layered defense system
    • physical
    • innate
    • adaptive
  • sell story
    • create value in US
    • competitiveness
    • jobs
    • protect us

Government/Strategy

  • NSF
    • talk to Engineering
    • come back to NSF with SRC as partner
  • SRC (Semiconductor Research Corporation)
    • get quotes form principals at IBM, Intel, Freescale
    • potential for new companies (e.g. CISCO)
    • customer directed funding

Education

  • resilience not currently being taught in EE, CS education
    • System Reliability exist as discipline, but not part of EE, CS education
  • how educate computer engineers on system reliability?
  • K-12 inspiration (FIRST, CubeSATs)
  • Competitions (c.f. best branch predictor..., DARPA Grand Challenge)
  • Added discussion page on Wiki to develop further

Research Organization

  • needs cross-cut work; not pushed forward by single-domain projects
  • big teams and centers?
  • serialization of research not an option (e.g. get accurate model of device effects then do arch/software design)
    • data for process reliability is late---never have it before start design
  • standard platforms/models---maybe preference to fund sharable platform work first?
  • relation/tie-in-to power management infrastructure
  • benchmarking/contests (something to fund early)

Life Critical

  • standards (IEC 61508)
  • technology
    • want to see history of technology before adopt
      • deliberately forgoe using most advanced technology so know more about how it will behave
      • would benefit from more operational data published (available) from existing technologies
    • automotive in production with 130nm (just starting to look at 90nm)
    • medical 250nm smallest in production now (looking at 90nm)
    • medical may never adopt 45nm
  • automotive (ISO 26262)
    • requirement is 0 PPM failure (but doesn't say 0 PPB, so sounds like 0.1 PPM acceptable)
    • 20 year life
    • probability of dangerous failure/hour < 10-7

    • always need more performance
  • medical
    • highly regulated
    • driven by low-power needs --> battery must last 5-10 years

    • electronic processing needs/content is growing
    • transients lead to power-on-reset
      • seeing increase in soft errors
      • soft errors now in PPM range (AMD: not sure what that means? needs a time component? per million hours???)
      • sometimes physician response to transient is to explant device
  • need for ways to demonstrate/quantify resilience
  • differences with other constituencies? (things they want to highlight)
    • beyond silicon---also analog, passives
      • (not a difference, also hear from computer, aerospace, just not focus of this effort -- note DARPA HEALICS)
    • need better reliability than current silicon (also not clearly a difference)

Infrastructure

  • physical distribution of sites (access/maintenance, delay in information propagation)
  • once deployed, seldom removed from service
    • drives heterogeneity in system (many different generations)
    • drives need for flexibility/adaptability (old stuff may never completely be replaced)
  • compute << machine being controlled (cost of compute not dominate plant)

  • degraded fallback possible (incl. fail off)
  • autonomous (power scavenged) monitoring
  • availability key metric (short fail-over tolerable)

Roadmap

  • still plan to add extrinsic noise model; will shift curves upward (more problematic)

Metrics

  • composition calculation must be more than summing FITs (otherwise, not account benefit of mitigation techniques)
  • FIT rate decompose
    • persistent vs. transient
    • impact
      • detected and corrected (slowdown)
      • failure of one application or VM (IBM speak: partition)
      • failure of full system
      • silent data corruption (SDC)
  • availability metrics (2D matrix -- event-rate and time-to-recover at each type/rate)
  • standardizing FIT ranges?
  • measure resilience of common things [in beam?] as an eye opener (map etherial into things people might understand)
    • e.g. house alarm (fire? intruder?), voting machine, Point-of-Sale terminal?)
    • step toward consumer reports or standards...
  • specmark for reliability

Addressing Challenges? (orphaned point)

  • public health system

Next Steps / Our path forward

  • reports from constituency groups by Dec. 1 (?)
  • quotes from executives soon (mid. Nov.?)
  • workshop summary (this) distributable by end of Nov.
  • 2p executive summary during November
  • DATE papers (driver for draft of key pieces) by end of Nov.
  • full report draft assembled in January? (cleaned up/polished in February?)
  • input to SRC December?
  • lobbying SRC, DARPA, NSF, others? ...

Meetings/Third/Summary (last edited 2009-11-24 21:54:45 by AndreDeHon)