= Infrastructure Challenges =


== Nature ==

Our modern infrastructure is highly computerized.  This infrastructure includes our power grid, our building control (heating and cooling, fire supression, security), and our telecommunications (phone, internet, cable).  All of these are directly or indirectly life critical---communication are necessary for emergency response, power is required to run emergency and hospital equipment as well as to keep our environments livable, and much of building control is associated with keeping us safe.  Because of the additional capabilities and economics they provide, the trend over time has been to increase the computational components of these systems.  Automation responds more quickly and consistently than humans and makes our systems run more efficiently.  All this means we must provide high reliability for an ever growing computerized system.

Because of our increasing dependence on computational and communication infrastructure (networking, computing, cloud computing---the combination of the network and the compute servers), outages also have a large, negative economic impact.  Many modern workplaces grind to a halt when the network is out, resulting in large costs (''e.g.'', consider the professional salaries of the impacted populace for the period of the outage, alternately consider the lost sales and reputation due to the outage).  In cases where the computation controls a larger physical plant (''e.g.'' power, heating, cooling), failures of the computation to provide appropriate control could endanger the controlled plan (''e.g.'' allow a power line to overload or a chemical reaction to proceed out of control). 

Infrastructure systems tend to be highly distributed.  In many cases, their spatial distribution is essential to the services they provide---we must get power out to a large area, communication is about connecting distant people and machines, and building control must reach into all spaces in a building.  This means computations cannot be centralized in a carefully controlled and environment and are less physically accessible.  It also means that system upgrades do not occur uniformly and the system as a whole will almost always be composed of many different generations of technology.

Computation in some of these infrastructure roles (''e.g.'' power, heating, cooling) is relatively inexpensive compared to the plant the computation is monitoring or controlling.  As a result, this class of system has been able to tolerate larger overhead costs for reliability (''e.g.'' if the computing is only 1% of the cost of the system, duplicating or triplicating it may only increase the system costs by 1--2%). 

Availability is a key metric.  What is the fraction of down time?  Short service failures (few milliseconds of network outage, few seconds of heating or cooling control) may be tolerable, so infrastructure systems care both about the frequency of upsets and the time to recover.  Long outage events must be very infrequent, whereas quick recovery events can occur at higher frequencies.

Increasing efficiency demands greater computational control---either to control more things or find solutions closer to the optimum.  This increases computational needs, but not excessively.  Many ''Green'' initiatives to reduce energy consumption rely on more sophisticated computation and monitoring to control energy usage.

Much of the economics of the computing infrastructure (perhaps more so in the context of networking and telecommunications) come from riding the main-stream technology wave.  So, while computing needs in some  infrastructural areas (perhaps power and building control) might be satisfied with a freeze at 180nm technology, this now places a premium cost on maintaining access to older technology---one that will make the electronic parts even more expensive.  The coupling and volume benefits between industries remains strong.


== Challenges ==


 * Affordably increase availability for increasingly large and distributed  infrastructures systems.
   * Availability cannot decrease.
   * With increasing  system sizes, this suggests an increasing reliability demand on each component.
   * While computation is often not the dominant costs in these systems,  the cost for the computation cannot increase significantly.
  
  Like life critical and aerospace systems, the availability and reliability demands for infrastructure is often higher than consumer components.  At the same time, these applications cannot afford completely unique components for their use.  This suggests these systems may benefit from adaptable systems where they can use standard components and configure them to provide higher reliability levels when serving in this role.
  
 * Infrastructure cannot afford to develop all components custom for their application and systems nor maintain processes and technologies unique from other market segments.

    This situation motivates the design of components and systems with modes and configuration options that allow higher layers in the system to tune what the components spends on reliability based on its market segment. 

 * While compute costs do not currently dominate, it, nonetheless, remains true that providing complete, guaranteed '''never-fail''' service is prohibitively expensive.
  
   This suggests a need to increase availability by:
   * providing degraded modes of operation that continue to provide some  level of availability when failures occurs.
   * providing affordable monitoring of the distributed infrastructure to allow early warnings of problems and to rapidly diagnose failures and expedite repairs.  Affordability in some cases will mean extremely low power for the monitors, suggesting a sensitivty to monitoring power requirements.
   * supporting remote reconfiguration and repair to rapidly restore some  level of service.


 * Human service costs must decrease despite increasing component complexity, increasing system size, and increasing distribution of components.  Advanced technology which may see earlier wear-out exacerbates this challenge.

   This suggests a greater need for automated and remote repair and adaptation.  It also suggests a need for adaptation to integrate, accommodate, and optimize systems composed of many heterogeneous technology generations.

 * Analog sensors, discrete components, and passive are not reliable enough.

    This suggests the need for mitigation at higher levels in the system stack for a truly reliable solution.  This, in turn, suggests a need for standard interfaces to increase the observability,  diagnosability, and control of the sensors.