Differences between revisions 2 and 3
Revision 2 as of 2009-11-08 20:55:09
Size: 5553
Editor: AndreDeHon
Comment: edit pass
Revision 3 as of 2009-11-08 20:56:25
Size: 5553
Editor: AndreDeHon
Comment: capitalization
Deletions are marked like this. Additions are marked like this.
Line 48: Line 48:
== Research organization == == Research Organization ==

Summary of Third Meeting (Oct 29--30, 2009, Austin, TX)

[AMD note: where possible, I've tried to pullout themes and order by importance rather than writing a chronological book report]

Goals

  • paragraph of framing for this meeting

Tuning up Story

  • (some directions to improve intro (10-20 slide) story---important as this is refining the story we'll be telling funders...and may be similar template to some of the documents)
  • more crisply define the goal; including: draw layers -- show error in logic; and handled by higher layers
  • include education
  • need to reflect fault rate up to system level for examples.
    • (e.g., This means your computer reboots once an hour? ...replace your processor every 10 minutes?)
  • should missions include {energy, economy} ?
  • better communicate this as a revolution

Framing for public: Immuno-Logic

  • Immune system analogy: layered defesne system
    • physical
    • innate
    • adaptive
  • sell story
    • create value in US
    • competitiveness
    • jobs
    • protect us

Government/Strategy

  • NSF
    • talk to Engineering
    • come back to NSF with SRC as partner
  • SRC
    • get quotes form principals at IBM, Intel, Freescale
    • potential for new companies (e.g. CISCO)
    • customer directed funding

Education

  • resilience not currently being taught in EE, CS education
    • System Reliability exist as discipline, but not part of EE, CS education
  • how educate computer engineers on system reliability?
  • K-12 inspiration (FIRST, CubeSATs)
  • Competitions (c.f. best branch predictor..., DARPA Grand Challenge)
  • Added discussion page on Wiki to develop further

Research Organization

  • needs cross-cut work; not pushed forward by single-domain projects
  • big teams and centers?
  • serialization of research not an option (e.g. get accurate model of device effects then do arch/software design)
    • data for process reliability is late---never have it before start design
  • standard platforms/models---maybe preference to fund sharable platform work first?
  • relation/tie-in-to power management infrastructure
  • benchmarking/contests (something to fund early)

Life Critical

  • standards (IEC 61508)
  • technology
    • want to see history of technology before adopt
    • automotive in production with 130nm (just starting to look at 90nm)
    • medical 250nm smallest in production now (looking at 90nm)
    • medical may never adopt 45nm
  • automotive (ISO 26262)
    • requirement is 0 PPM failure (but doesn't say 0 PPB, so sounds like 0.1 PPM acceptable)
    • 20 year life
    • probability of dangerous failure/hour < 10^{-7}

    • always need more performance
  • medical
    • highly regulated
    • driven by low-power needs --> battery must last 5-10 years

    • electronic processing needs/content is growing
    • transients lead to power-on-reset
      • seeing increase in soft errors
      • soft errors now in PPM range (AMD: not sure what that means? needs a time component? per million hours???)
      • sometimes physician response to transient is to explant device
  • need for ways to demonstrate/quantify resilience
  • differences with other constituencies? (things they want to highlight)
    • beyond silicon---also analog, passives
      • (not a difference, also hear from computer, aerospace, just not focus of this effort -- note DARPA HEALICS)
    • need better reliability than current silicon (also not clearly a difference)

Infrastructure

  • physical distribution of sites (access/maintenance, delay in information propagation)
  • compute << machine being controlled (cost of compute not dominate plant)

  • degraded fallback possible (incl. fail off)
  • autonomous (power scavenged) monitoring
  • availability key metric (short fail-over tolerable)

Roadmap

  • still plan to add extrinsic noise model; will shift curves upward (more problematic)

Metrics

  • composition calculation must be more than summing FITs (otherwise, not account benefit of mitigation techniques)
  • FIT rate decompose
    • persistent vs. transient
    • impact
      • detected and corrected (slowdown)
      • failure of one application or VM (IBM speak: partition)
      • failure of full system
      • silent data corruption (SDC)
  • availability metrics (2D matrix -- event-rate and time-to-recover at each type/rate)
  • standardizing FIT ranges?
  • measure resilience of common things [in beam?] as an eye opener (map etherial into things people might understand)
    • e.g. house alarm (fire? intruder?), voting machine, Point-of-Sale terminal?)
    • step toward consumer reports or standards...
  • specmark for reliability

Addressing Challenges? (orphaned point)

  • public health system

Next Steps / Our path forward

  • reports from constituency groups by Dec. 1 (?)
  • quotes from executives soon (mid. Nov.?)
  • workshop summary (this) distributable by end of Nov.
  • 2p executive summary during November
  • DATE papers (driver for draft of key pieces) by end of Nov.
  • full report draft assembled in January? (cleaned up/polished in February?)
  • input to SRC December?
  • lobbying SRC, DARPA, NSF, others? ...

Meetings/Third/Summary (last edited 2009-11-24 21:54:45 by AndreDeHon)