Summary of Third Meeting (Oct 29--30, 2009, Austin, TX)

Goals

This final study meeting had two goals: understanding the constituency groups that had not presented at previous meetings, and crafting a plan for how the results of the study would be written up and presented to the CCC and funding agencies. Presentations from the life-critical systems and infrastructure working groups outlined the key issues facing their communities and some overlaps with other communities. Program managers from the NRL, NSF and DARPA attended the meeting, and provided feedback on how the study's results could be made most useful to them. A number of participants suggested that the study group propose a multi-agency program to fund cross-layer resilience research, and much of the later discussion focused on ways to pursue this suggestion.

Tuning up Story

The workshop started in the same manner as the previous two workshops by telling the cross-layer visioning story. Presenting the 10-20 slide story allows us to provide a basis for first time attendees and to refine the story to be told to funding agencies. As always, this presentation starts the discussion on reliability challenges and cross-layer reliability approaches. There were a number of suggestions that the audience provided:

They pointed out that we should more crisply define the goals of the research. The suggestion was to show how the errors in logic could be handled by higher layers.
They pointed out that NSF has a strong education focus.
They pointed out that they needed more context on the fault rate so that lay people could understand what the fault rate means. For example, does it mean that one will need to reboot their computer every hour or that one will replace their processor every 10 minutes or that only 1% of hardware devices will yield?
They pointed out that the mission impacts should include economic factors, as well as energy. In this part of the discussion the ability for the U.S. to compete in the global economy came up several times.
They pointed out that all of this work should be communicated as a revolution rather than a revision.

Framing for public: Immuno-Logic

One of the breakout sessions focused on ways to sell this type of project to the lay people. This discussion focused on two different objectives: finding slogans that non-technical people would immediately jive to [HMQ: yes, I know I need a better word here.] and how to sell the story. The clear winning concepts on slogans was "Immuno-Logic." Everyone felt that the immune system analogy was a good concept for cross-layer reliability, as both the human immune system and cross-layer reliability are multi-layer defense systems. Furthermore, lay people have a basic understanding of the immune system and understand how detrimental diseases that directly affect the functionality of the immune system, such as Leukemia and AIDS, are. Most people also have an understanding that the human immune system is innate and adaptive, which are two properties that we want computing systems to have. In both cases, the first line of defense is at the physical layer (devices reliability for circuits and physical boundaries that keep pathogens out for the immune system) with additional, usually higher layer mechanisms, addressing the attacks that get past the physical defenses.

Being able to effectively sell sell the cross-layer reliability story to funding agencies and congress is necessary for further research progress on this topic. The discussion here focused on methods of protecting US-based technical companies/jobs and protecting us. The technology industry for several years felt pressure to outsource technical work to China and India. Many US-based companies outsource technical work to these countries to remain competitive in the global technology economy. The effects of this shift can be seen in both the increase in off-shore fabrication of silicon devices and the increase of off-shore electronics companies. Many participants stated that increasing the reliability of US-designed computing systems would help create value in US-built computing systems, increase the competitiveness of US-based companies, and increase jobs in the US technology market.

There is also a very compelling story to be told in how our computing systems protect us. As stated in later sections, the cost of reliability failures in automobiles, medical-implantable devices, and the energy infrastructure can be quite high. Reliability failures in these arenas can be expensive both in terms of human lives lost, but economically, too. Fairly trivial reliability failures in medical-implantable devices can lead to surgery to have the device explanted and replaced with a new device. In 2003 a cascading failure in the OH power infrastructure ended up affecting the entire northeastern US and Canada, which left 55 million people without power, played a role in 11 fatalities, and cost an estimated $6B [http://en.wikipedia.org/wiki/Northeast_Blackout_of_2003, http://www.scientificamerican.com/article.cfm?id=2003-blackout-five-years-later]. For our society to continue to embrace automation in banking, medical devices, automobiles, and infrastructure, the average, non-technical person needs to feel comfortable that the automation increases and not decreases their overall safety. Finally, we rely heavily on computational support for persistence surveillance for treaty monitoring of both the comprehensive test ban treaty and environmental treaties, as well as warfighter support for the wars in Iraq and Afghanistan. Reliability failures in these arenas can cause fatalities in the battlefield, lead to bad policy decisions, and create confusion in the geo-political arena [http://en.wikipedia.org/wiki/South_Atlantic_Flash].

Government/Strategy

[HMQ: My notes are weak on this topic and Andre's notes a little cryptic, so I could use Andre's help on this one.]

NSF
- talk to Engineering
- come back to NSF with SRC as partner
SRC (Semiconductor Research Corporation)
- get quotes form principals at IBM, Intel, Freescale
- potential for new companies (e.g. CISCO)
- customer directed funding

Education

The participants also had an open discussion regarding education, as many stated that resilience is not being taught currently in the EE and CS curricula. One participant pointed out that system reliability is taught as a discipline to mechanical and civil engineers, so there is a precedence of teaching these types of ideas to undergraduate engineers. Many people also pointed out the we needed to start thinking about how to teach system reliability to computer engineers, including how to work on the K-12 pipeline (e.g. robotics, cubesat projects, and competitions). Several people also felt that there could be competitions tied to conferences, such as the branch predictor competition that was tied to MICRO 37 and 39 [http://www.jilp.org/cbp/, http://cava.cs.utsa.edu/camino/cbp2/]. For continued discussion on this topic, we have added a new wiki page to the relxlayer website to brainstorm educational opportunities [http://www.relxlayer.org/Education].

Research Organization

At this meeting, we introduced a discussion on research organization that NSF brought up when they met with Nick, Heather, and Andre in September. Because the work that is needed to be done crosses the entire hierarchy of the computing system, the research needs to be cross-cutting work, demanding collaboration across disciplines and teams. This might necessitate big teams or centers to make progress and goes against funding models that focus on single-domain projects. Serialization of the research is also not possible, as getting an accurate model of device effects depends on a working architecture/software implementation in the technology. Two areas were discussed as possible near-term funding opportunities -- standard platforms/models and benchmarking -- as progress in these areas will provide the basis for later research.

[Andre': I am not certain what the below bullet is for]

relation/tie-in-to power management infrastructure

Life Critical

The life critical group briefed the workshop for the first time at this meeting. This group had two brief ins -- one from automotive and one from medically-implanted devices. Both of these groups are regulated by the IEC 61508 standard, which is an international standard for "safety-related devices" [http://en.wikipedia.org/wiki/IEC_61508]. The automotive industry is also using a draft standard ISO 26262, which is an attempt to clarify IEC 61508 as it applies to automotive. Because of the safety concerns, these industries deliberately forgo advanced technology until the larger commercial industry determines how scaling affects the reliability of the technology. They also pointed out that they would benefit from more publicly-available operational data from existing technologies. Currently, automotive technology has 130nm devices in production and medical has 250nm devices. Both industries are starting to look at adopting 90nm technology. The medical industry might never adopt 45nm due to reliability and power concerns. Because of the safety-related concerns, both industries need a way to demonstrate/quantify resilience, if moving to new reliability methodologies.

The automotive industry demands high-reliability, long-life products. There is a requirement of 0 PPM, although discussion around this point made it sound like 0.1 PPM might be reasonable. [AMD: TODO sentence or two about how 0 is non-sensical...certainly a non-sensical goal.] [HMQ: When you look at the wiki they state that 0 is not the goal, so maybe some of this is interpretation.] The electronics in cars are expected to last the lifetime of the car, which can easily be 20+ years. The probability of dangerous failure per hour must be less than 1E-7. They stated that they always need more performance.

The medical industry is primarily driven by low-power needs, as implantable devices must last 5-10 years on the same battery. Much like automobiles, compute processing needs continue to grow. They are starting to see an increase in soft errors in these devices. Soft errors are now in the PPM range. The most common failure mode for these devices is a power-on-reset (POR), as they coerce a large number of errors into a POR. The most common [AMD: a common -- don't think most] [HMQ: Ask Mark.] response to a device that has experienced a POR is to explant the device and replace it with a new device.

Unlike other constituency groups, this group highlighted the need for better reliability than current silicon devices. Like consumer electronics and aerospace, this group is also looking at how analog devices and passives affect the entire system reliability.

Infrastructure

The infrastructure group also briefed for the first time. This group specifically discussed how the physical distribution of the sites affects reliability. As the power grid affects the entire country, access to system maintenance can be delayed by physical distance and there can be a delay in information propagation in the system. Once systems are deployed they are seldom removed from service, which means that the infrastructure systems is extremely heterogeneous and individual computing systems may span several generations of electronics. This heterogeneity necessitates flexible and adaptive reliability solutions that can be adopted to legacy systems that cannot be replaced. Furthermore, the cost of computing does not dominate the system, as the computing systems are much cheaper than the machine they are controlling.

Unlike other constituency groups, the infrastructure group discussed the use of degraded fallback. [HMQ: Later the aerospace group stated how much they liked the degraded fallback and came up with their own slogan of "graceful degradation instead of abject failure."] This group also stated that their standard metric was availability instead of reliability.

[HMQ: I am missing the below point in my notes]

autonomous (power scavenged) monitoring

Roadmap

The roadmap group provided information on their recent progress. They have prepared draft changes to the ITRS. They are currently still working on adding the extrinsic noise model, which will shift the curves toward less reliabile components.

Metrics

The metrics group updated the meeting with their progress. In this brief they discussed how the composition calculation must be more than summing FITs. If there are mitigated parts, the summing would not take into account the benefit of the mitigation techniques. In that way a TMR-protected part would not have a FIT rate based on the parts (i.e., 3X), but a FIT rate based on the system. They suggested decomposing the FIT rate calculation into persistent vs. transient errors and the impact of the error, such as detected and corrected (slowdown); failure of one application, virtual machine, or partition; full system failure; or silent data corruption (SDC). Infrastructure advocated this during that the availability metric should be two dimensional using event-rate with time-to-recover for each type of error. They also suggested standardizing FIT metric ranges in a similar fashion as the infrastructure standards.

The metrics groups suggested that we measure the resilience of common electronics, such as house alarms, voting machined and point-of-sale terminals. They pointed out the reliability of these every day objects would be a concrete illustration of an academic topic that lay people might understand. They felt that this would be a step toward consumer reports or standards [HMQ: ?].

Their final suggestion was for a benchmark for reliability.

Addressing Challenges? (orphaned point)

public health system

[HMQ: not certain what to do with this one.]

Next Steps / Our path forward

reports from constituency groups by Dec. 1 (?)
quotes from executives soon (mid. Nov.?)
workshop summary (this) distributable by end of Nov.
2p executive summary during November
DATE papers (driver for draft of key pieces) by end of Nov.
full report draft assembled in January? (cleaned up/polished in February?)
input to SRC December?
lobbying SRC, DARPA, NSF, others? ...