Summary of Third Meeting (Oct 29--30, 2009, Austin, TX)

Goals

This final study meeting had two goals: understanding the constituency groups that had not presented at previous meetings, and crafting a plan for how the results of the study would be written up and presented to the CCC and funding agencies. Presentations from the life-critical systems and infrastructure working groups outlined the key issues facing their communities and some overlaps with other communities. Program managers from the NRL, NSF and DARPA attended the meeting, and provided feedback on how the study's results could be made most useful to them. In particular, a number of participants suggested that the study group propose multi-agency program to fund cross-layer resilience research, and much of the later discussion focused on ways to pursue this suggestion.

Tuning up Story

We started the workshop as we have the other two workshops by telling the cross-layer visioning story for the series of workshops. Presenting the 10-20 slide story allows us to provide a basis for first time attendees and to refine the story we'll be telling funding agencies. As always, this presentation starts the discussion on reliability challenges and cross-layer reliability approaches. There were a number of suggestions that the audience provided:

Framing for public: Immuno-Logic

One of the breakout sessions focused on ways to sell this type of project to the lay people. This discussion focused on two different objectives: finding slogans that non-technical people would immediately jive to [HMQ: yes, I know I need a better word here.] and how to sell the story. The clear winning concepts on slogans was "Immunologic." Everyone felt that the immune system analogy was a good concept for cross-layer reliability, as both the human immune system and cross-layer reliability are multi-layer defense systems. Furthermore, lay people have a basic understanding of the immune system and understand how detrimental diseases that directly affect the functionality of the immune system, such as Leukemia and AIDS, are. Most people also have an understanding that the human immune system is innate and adaptive, which are two properties that we want computing systems to have. Finally, there is a synergy between between the human immune system and cross-layer reliability due to the physical nature of both systems. [HMQ: I wasn't certain what to do with the physical sub-bullet.]

There was also further discussion of how to sell the cross-layer reliability story. Many of the points brought up here were discussing methods of protecting US-based jobs and protecting us. The technology industry for several years felt pressure to outsource technical work to China and India. Many companies feel that outsourcing technical work to these countries is necessary to remain competitive price-wise in the global technology economy. This can be seen in both the increase in off-shore fabrication of silicon and the increase of off-shore electronics companies. Many people felt that increasing the reliability of US-based computing systems would help create value in US-built computing systems, increase the competitiveness of US-based companies, and increase jobs in the US technology market.

There is also a very strong story to be told in how our computing systems protect us. As stated in later sections, the cost of reliability failures in automobiles, medical implantable devices, and the energy infrastructure can be quite high. Reliability failures in these arenas can be expensive both in terms of human lives lost, but economically, too. Fairly trivial reliability failures in medical implantable devices can cause lead to surgery to have the device explanted and replaced with a new device. In 2003 a cascading failure in the OH power infrastructure ended up affecting the entire northeastern US and Canada, which left 55 million people without power, played a role in 11 fatalities, and cost $6B [http://en.wikipedia.org/wiki/Northeast_Blackout_of_2003, http://www.scientificamerican.com/article.cfm?id=2003-blackout-five-years-later]. Finally, we rely heavily on computational support for persistence surveillance for treaty monitoring of both the comprehensive test ban treaty and environmental treaties, as well as warfighter support for the wars in Iraq and Afghanistan. Reliability failures in these arenas can cause fatalities in the battlefield, lead to bad policy decisions, and create confusion in the geo-political arena [http://en.wikipedia.org/wiki/South_Atlantic_Flash].

Government/Strategy

[HMQ: My notes are weak on this topic and Andre's notes a little cryptic, so I could use Andre's help on this one.]

Education

There was a spirited discussion on education, as many people felt that resilence is not being taught currently in the EE, CS curricula. One participant pointed out that system reliability is taught as a discipline to mechanical and civil engineers, so there is a precedence of teaching these types of ideas to engineers. Many people also pointed out the we needed to start thinking about how to teach system reliability to computer engineers, including how to work on the K-12 pipeline with robotics and cubesat projects and competitions. Several people also felt that could be competitions tied to conferences, as the branch predictor competition was tied to HPCA [HMQ: ?]. We have also added a new wiki page to the relxlayer website to foster continued discussion regarding educational opportunities [http://www.relxlayer.org/Education].

Research Organization

At this meeting, we introduced a discussion on research organization that NSF brought up in our September meeting. Because the work that is needed to be done crosses the entire hierarchy of the computing system, the research needs to be cross-cutting work, demanding collaboration across disciplines and teams. This might necessitate big teams or centers to make progress and goes against funding models that focus on single-domain projects. Serialization of the research is also not possible, as getting an accurate model of device effects depends on a working architecture/software implementation in the technology. Two areas were discussed as possible near-term funding opportunities -- standard platforms/models and benchmarking -- as progress in these areas could provide the basis for later research.

[Andre': I am not certain what the below bullet is for]

Life Critical

The life critical group briefed the group for the first time at this meeting. This group had two brief ins -- one from automotive and one from medically implanted devices. Both of these groups are regulated by the IEC 61508 standard, which is an international standard for "safety-related devices" [http://en.wikipedia.org/wiki/IEC_61508]. The automotive industry is also using a draft standard ISO 26262. Because of the safety concerns, these industries deliberately forgo advanced technology until the larger commercial industry determines how scaling affects the reliability of the technology. They also pointed out that they would benefit from more publicly-available, operational data from existing technologies. Currently, automotive technology has 130nm devices in production and medical has 250nm devices. Both industries are starting to look at adopting 90nm technology. The medical industry might never adopt 45nm due to reliability and power concerns. Because of the safety-related concerns, both industries need a way to demonstrate/quantify resilience.

The automotive industry demands high-reliability, long-life products. There is a requirement of 0 PPM, although discussion around this point made it sound like 0.1 PPM might be reasonable. The electronics in cars are expected to last the lifetime of the car, which can easily be 20+ years. The probability of dangerous failure/hour must be less than 10E-7. They stated that they always need more performance.

The medical industry is primarily driven by low-power needs, as implantable devices must last 5-10 years on the same battery. Much like automobiles, computing processing needs continues to grow. They are starting to see an increase in soft errors in these devices. Soft errors are now in the PPM range. The most common failure mode for these devices is a power-on-reset (POR), which seems to be a generic categorization for several types of errors. The most common response to a device that has experienced a POR is to explant the device and replace it with a new device.

Unlike other constituency groups, this group highlighted the need for better reliability than current silicon devices. Like consumer electronics and aerospace, this group is also looking at how analog devices and passives affect the entire system reliability.

Infrastructure

The infrastructure group also briefed the meeting for the first time. Unlike many other groups that have briefed these workshops, this group specifically discussed how the physical distribution of the sites affects reliability. As the power grid affects the entire country, access to maintenance can be delayed by physical distance and there can be a delay in information propagation in the system. Once systems are deployed they are seldom removed from service, which means that the infrastructure systems is extremely heterogeneous and individual computing systems may span several generations of electronics. This heterogeneity necessitates flexible and adaptive reliability solutions that can be adopted to legacy systems that cannot be replaced. Furthermore, the cost of computing does not dominate the system, as the computing systems are much cheaper than the machine being controlled.

Unlike other constituency groups, the infrastructure group discussed the use of degraded fallback. [HMQ: Later the aerospace group stated how much they liked the degraded fallback and came up with their own slogan of "graceful degradation instead of abject failure."] This group also stated that their standard metric was availability instead of reliability.

[HMQ: I am missing the below point in my notes]

Roadmap

Metrics

Addressing Challenges? (orphaned point)

Next Steps / Our path forward