Differences between revisions 16 and 17

Summary of Third Meeting (Oct 29--30, 2009, Austin, TX)

Goals

This final study meeting had two goals: understanding the constituency groups that had not presented at previous meetings, and crafting a plan for how the results of the study would be written up and presented to the CCC and funding agencies. Presentations from the life-critical systems and infrastructure working groups outlined the key issues facing their communities and some overlaps with other communities. Program managers from the NRL, NSF and DARPA attended the meeting, and provided feedback on how the study's results could be made most useful to them. In particular, a number of participants suggested that the study group propose multi-agency program to fund cross-layer resilience research, and much of the later discussion focused on ways to pursue this suggestion.

Tuning up Story

We started the workshop as we have the other two workshops by telling the cross-layer visioning story for the series of workshops. Presenting the 10-20 slide story allows us to provide a basis for first time attendees and to refine the story we'll be telling funding agencies. As always, this presentation starts the discussion on reliability challenges and cross-layer reliability approaches. There were a number of suggestions that the audience provided:

They felt that we should more crisply define the goal of the research. This included showing how the errors in the logic could be handled by higher layers.
We were also reminded that NSF has a strong education focus.
They felt that we needed to provide more context to the fault rate so that people could understand what that means in the large picture. (e.g., This means your computer reboots once an hour? ...replace your processor every 10 minutes?) (maybe a yield-based example -- NPC)
There was also a lengthy discussion of whether the mission impacts should only be energy or should include economic factors. Discussion regarding the ability for the U.S. to compete in the global economy came up several times.
It was also stated that all of this work should be communicated as revolution rather than a revision.

Framing for public: Immuno-Logic

One of the breakout sessions focused on ways to sell this type of project to the lay people. This discussion focused on two different objectives: finding slogans that non-technical people would immediately jive to [HMQ: yes, I know I need a better word here.] and how to sell the story. The clear winning concepts on slogans was "Immunologic." Everyone felt that the immune system analogy was a good concept for cross-layer reliability, as both the human immune system and cross-layer reliability are multi-layer defense systems. Furthermore, lay people have a basic understanding of the immune system and understand how detrimental diseases that directly affect the functionality of the immune system, such as Leukemia and AIDS, are. Most people also have an understanding that the human immune system is innate and adaptive, which are two properties that we want computing systems to have. Finally, there is a synergy between between the human immune system and cross-layer reliability due to the physical nature of both systems. [HMQ: I wasn't certain what to do with the physical sub-bullet.]

There was also further discussion of how to sell the cross-layer reliability story. Many of the points brought up here were discussing methods of protecting US-based jobs and protecting us. The technology industry for several years felt pressure to outsource technical work to China and India. Many companies feel that outsourcing technical work to these countries is necessary to remain competitive price-wise in the global technology economy. This can be seen in both the increase in off-shore fabrication of silicon and the increase of off-shore electronics companies. Many people felt that increasing the reliability of US-based computing systems would help create value in US-built computing systems, increase the competitiveness of US-based companies, and increase jobs in the US technology market.

There is also a very strong story to be told in how our computing systems protect us. As stated in later sections, the cost of reliability failures in automobiles, medical implantable devices, and the energy infrastructure can be quite high. Reliability failures in these arenas can be expensive both in terms of human lives lost, but economically, too. Fairly trivial reliability failures in medical implantable devices can cause lead to surgery to have the device explanted and replaced with a new device. In 2003 a cascading failure in the OH power infrastructure ended up affecting the entire northeastern US and Canada, which left 55 million people without power, played a role in 11 fatalities, and cost $6B [http://en.wikipedia.org/wiki/Northeast_Blackout_of_2003, http://www.scientificamerican.com/article.cfm?id=2003-blackout-five-years-later]. Finally, we rely heavily on computational support for persistence surveillance for treaty monitoring of both the comprehensive test ban treaty and environmental treaties, as well as warfighter support for the wars in Iraq and Afghanistan. Reliability failures in these arenas can cause fatalities in the battlefield, lead to bad policy decisions, and create confusion in the geo-political arena [http://en.wikipedia.org/wiki/South_Atlantic_Flash].

Government/Strategy

[HMQ: My notes are weak on this topic and Andre's notes a little cryptic, so I could use Andre's help on this one.]

NSF
- talk to Engineering
- come back to NSF with SRC as partner
SRC (Semiconductor Research Corporation)
- get quotes form principals at IBM, Intel, Freescale
- potential for new companies (e.g. CISCO)
- customer directed funding

Education

There was a spirited discussion on education, as many people felt that resilence is not being taught currently in the EE, CS curricula. One participant pointed out that system reliability is taught as a discipline to mechanical and civil engineers, so there is a precedence of teaching these types of ideas to engineers. Many people also pointed out the we needed to start thinking about how to teach system reliability to computer engineers, including how to work on the K-12 pipeline with robotics and cubesat projects and competitions. Several people also felt that could be competitions tied to conferences, as the branch predictor competition was tied to HPCA [HMQ: ?]. We have also added a new wiki page to the relxlayer website to foster continued discussion regarding educational opportunities [http://www.relxlayer.org/Education].

Research Organization

needs cross-cut work; not pushed forward by single-domain projects
big teams and centers?
serialization of research not an option (e.g. get accurate model of device effects then do arch/software design)
- data for process reliability is late---never have it before start design
standard platforms/models---maybe preference to fund sharable platform work first?
relation/tie-in-to power management infrastructure
benchmarking/contests (something to fund early)

Life Critical

standards (IEC 61508)
technology
- want to see history of technology before adopt
  - deliberately forgoe using most advanced technology so know more about how it will behave
  - would benefit from more operational data published (available) from existing technologies
- automotive in production with 130nm (just starting to look at 90nm)
- medical 250nm smallest in production now (looking at 90nm)
- medical may never adopt 45nm
automotive (ISO 26262)
- requirement is 0 PPM failure (but doesn't say 0 PPB, so sounds like 0.1 PPM acceptable)
- 20 year life
- probability of dangerous failure/hour < 10^-7
- always need more performance
medical
- highly regulated
- driven by low-power needs --> battery must last 5-10 years
- electronic processing needs/content is growing
- transients lead to power-on-reset
  - seeing increase in soft errors
  - soft errors now in PPM range (AMD: not sure what that means? needs a time component? per million hours???)
  - sometimes physician response to transient is to explant device
need for ways to demonstrate/quantify resilience
differences with other constituencies? (things they want to highlight)
- beyond silicon---also analog, passives
  - (not a difference, also hear from computer, aerospace, just not focus of this effort -- note DARPA HEALICS)
- need better reliability than current silicon (also not clearly a difference)

Infrastructure

physical distribution of sites (access/maintenance, delay in information propagation)
once deployed, seldom removed from service
- drives heterogeneity in system (many different generations)
- drives need for flexibility/adaptability (old stuff may never completely be replaced)
compute << machine being controlled (cost of compute not dominate plant)
degraded fallback possible (incl. fail off)
autonomous (power scavenged) monitoring
availability key metric (short fail-over tolerable)

Roadmap

still plan to add extrinsic noise model; will shift curves upward (more problematic)

Metrics

composition calculation must be more than summing FITs (otherwise, not account benefit of mitigation techniques)
FIT rate decompose
- persistent vs. transient
- impact
  - detected and corrected (slowdown)
  - failure of one application or VM (IBM speak: partition)
  - failure of full system
  - silent data corruption (SDC)
availability metrics (2D matrix -- event-rate and time-to-recover at each type/rate)
standardizing FIT ranges?
measure resilience of common things [in beam?] as an eye opener (map etherial into things people might understand)
- e.g. house alarm (fire? intruder?), voting machine, Point-of-Sale terminal?)
- step toward consumer reports or standards...
specmark for reliability

Addressing Challenges? (orphaned point)

public health system

Next Steps / Our path forward

reports from constituency groups by Dec. 1 (?)
quotes from executives soon (mid. Nov.?)
workshop summary (this) distributable by end of Nov.
2p executive summary during November
DATE papers (driver for draft of key pieces) by end of Nov.
full report draft assembled in January? (cleaned up/polished in February?)
input to SRC December?
lobbying SRC, DARPA, NSF, others? ...

-  ⇤ ← Revision 16 as of 2009-11-19 01:11:50 → 
  Size: 10347
  Editor: HeatherQuinn
  Comment:
+   ← Revision 17 as of 2009-11-19 01:37:12 → ⇥
  Size: 10935
  Editor: HeatherQuinn
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 30:
+[HMQ: My notes are weak on this topic and Andre's notes a little cryptic, so I could use Andre's help on this one.]
-Line 39:
+Line 42:
-   * resilience not currently being taught in EE, CS education
         * System Reliability exist as discipline, but not part of EE, CS education
   * how educate computer engineers on system reliability?
   * K-12 inspiration  (FIRST, CubeSATs)
   * Competitions (c.f. best branch predictor..., DARPA Grand Challenge)
   * Added discussion page on Wiki to develop further
+There was a spirited discussion on education, as many people felt that resilence is not being taught currently in the EE, CS curricula.  One participant pointed out that system reliability is taught as a discipline to mechanical and civil engineers, so there is a precedence of teaching these types of ideas to engineers.  Many people also pointed out the we needed to start thinking about how to teach system reliability to computer engineers, including how to work on the K-12 pipeline with robotics and cubesat projects and competitions.  Several people also felt that could be competitions tied to conferences, as the branch predictor competition was tied to HPCA [HMQ: ?].  We have also added a new wiki page to the relxlayer website to foster continued discussion regarding educational opportunities [http://www.relxlayer.org/Education].