= Final Report Outline = == (Draft in Progress) == Four Pieces: 1. Public (non-technical) Executive Summary 1. Technical Executive Summary 2. Program Vision (main description) 3. Appendices a. sample research solicitation a. full roadmap report a. full metrics report a. constituency group reports ------------------------------------------------------------------------------------------------- === Public (non-technical) Executive Summary === Target: 2p ['''DONE''' -- see top page] * Immuno-Logic ------------------------------------------------------------------------------------------------- === Technical Executive Summary === Target: 4p ['''DONE''' -- see top page] * scaling good, computers everywhere...critical to our lives and economy at many levels * challenges ahead due to reliability and power * hints there is something we can do * need research to develop new solutions * here's what we think we can do for you * sexy things that additional scaling would give you (might not get otherwise) * [added 12/21] challenge round up (see section in DATE vision paper), but digest to one line each for executive summary * [added 12/21] big question list * [added 12/21] mission impacts (ideally one-liner for each mission) * [added 12/21] research organization and research priority recommendations * need leadership (government leadership) ------------------------------------------------------------------------------------------------- === Program Vision === * What are you trying to do? * Allow continued scaling benefits * Reduce energy/operation * Reduce $$/gate * Increase ops/time with limited power-density budget * While maintaining or improving safety * Navigate inflection points in energy and reliability * Why now? * inflection points in reliability, energy * critical deployment of computation * system size? * How is it done today? * Demand reliable, consistent device operation * Margin for worst-case device effect Of billions, over multi-year lifetime * Discard components when devices fail * System-level redundancy * The niches where above is not good enough are small but important (avionics, medical) * Spend considerable $$, energy for reliability * E.g. Brute-force replication * Many-year performance lag behind commercial systems * Trends? * power limited trends? ... gap from margining? * voltage scaling (ITRS) * decreased dopants --> variability (ITRS) * roadmap work on rate of variation-induced defects * increasing transistors/chip * increasing system system sizes (supercomp, data centers) * decreased opportunity for burnin * increase wearout effects * decreased critical charge --> increased upset susceptibility * roadmap work on {intrinsic,extrinsic} upsets? * GDP in electronics? * electronics in critical systems * necessary for nanoscale/nanotech -- (work in disruptive technologies) * TODO: connect low-level fault rate to high-level impact (people will see) * primarily this is something like: "this fault (variation?) rate will cause this problem in the future" * spend X% of time rebooting/recovering ? * crash after YY minutes of operation? * spend Z% on energy overhead? * die after Q (too few) days? * horror stories (don't necessarily speak to grounding implications of fault rates for lay public---only grounding the impact of failures of electronic systems) * http://www.philstar.com/Article.aspx?articleId=525112&publicationSubCategoryId=200 * http://en.wikipedia.org/wiki/Northeast_Blackout_of_2003 * http://en.wikipedia.org/wiki/Vela_Incident * http://csem.engin.umich.edu/muri/MURIreport2004.pdf * http://csem.engin.umich.edu/muri/MURIreport2004.pdf * What can we accomplish? * Build reliable systems from unreliable components * Efficiently compensate for unpredictable devices through cooperation at higher levels of system stack * Show stack --- show errors arise at low level and detected/squashed at higher * Ground goals * scale how much further? * allow how many more ops/Joule? * how close to raw scaling? * extend component life by how much? * ...more... (depend on / synch with challenges) * What's new? (Ideas and promising directions) Ubiquitously/pervasively exploit: * Cross-layer codesign --- Multi-level tradeoffs (generalization of hardware/software) (incl. tools to support, system abstractions, algorithm design) * Design prepared for self-assessment of safety margins and repair * Cooperative filtering of errors at multiple levels * Strategic, low-overhead redundancy * Differential reliability * Scalable and adaptive solutions * Why do this? * Reliability matters for everything moving forward...just a matter of how much (in the past, for a large class of applications and system sizes, the base technology was reliable enough that there was no need to address reliability in the design; as we move forward, between increasing technology noise and increasing base system sizes, it must be addressed in design for all applications.) * Allow scaling to continue without sacrificing safety * Continued reduction in energy/op * Continued reduction in $$/op * Maintain or extend component lifetimes * How much further? * Allow construction of larger, dependable systems * Make infrastructural technology worthy of the trust we place in it * Specific big wins (challenges overcome) from focus groups? * feasible to fly commercial? and/or have any access to most advanced technology? (close commercial/aerospace component gap?); allow areospace to exploit modern electronics? * advanced technology safe for drive-by-wire? * enable larger (more components, computation) medical devices? * enable supercomputers able to solve XXX problems? * reduce energy used by computers... * ??? review focus group output and select appropriate to highlight here ??? * Can't have security without getting resilience under control * Why government leadership? * leadership: cross-industry * economic * safety * security * Challenge problems and areas of pain * Common/cross-cutting challenges * Varying demands, workloads, environment (and uncertainty about the environment) means worst-case design is overdesign for most uses. This motivates adaptive solutions. * Worst-case design independent of the application and its needs is too expensive. Similarly, worst-case design for uncommon, but potentially avoidable, worst-case scenarios is also a large, unnecessary cost. These motivate cross-layer, application-aware solutions and/or models/middleware that support management of operational aspect of application. * Fully custom/unique construction of all components is not viable (costs, manpower) for anyone. Some domains see more acute versions of this, but no domain is really able to do everything custom themselves these days. This motivate: interfaces/metrics/tools to perform composition/analysis/optimization/validation of separately sourced (sub)components. * Across the board, there is considerable conservative overdesign. This motivates system assessment methodology, tools support energy-delay-area-reliability-thermal-mechanical space. * Environment, energy demands, deployed system context, and even technology noise and maturity are all late bound, possibly not known during and design, and maybe not known until deployment. This motivates modes and configuration options that allow the component to tune what it spends on reliability. This could allow commercial devices to enhance yield or operate at extremely low energy levels while also making the same parts more usable in larger scale systems or harsher environments. * By focus group (5) * Commercial * Address growing reliability challenge with small enough overhead to avoid negating benefit of scaling * Reduce energy per operation while retaining reliable operation * Maintain or extend lifetimes in face of increasing wear effects * Economically address demand for components with different reliability needs * Navigating complex, multidimensional design space * Aerospace * System lifetimes >> changes in political and scientific need * Navigating complex, multidimensional design space * Widening gap between commercial and mil/areo components * Design for (uncommon) worst-case environment * Bottleneck in testing * Focus on part reliability over system reliability * Large Scale * Overhead required to achieve reliability using current and traditional fault-tolerance approaches is too high. * Life Critical * Infrastructure * Big Science Questions? * How do we organize, manage, and analyze layering for cooperative fault mitigation? * How do we best accommodate repair? * What is the right level of filtering at each level of the hierarchy? * Can we establish a useful theory and collection of design patterns for lightweight checking? * What would a theory and framework for expressing and reasoning about differential reliability look like? * Can a scalable theory and architectures that will allow adaptation to various upset rates and system reliability targets be developed? * Mission Impacts? (perhaps related to stuff trying to summarize above under specific challenges overcome---but, the detail support goes here; this is also the challenges/targets/opportunities for more mission-oriented agencies) * security * cyberphysical * satellite * supercomputer * green? * ??? * Education * what's missing in curriculum just to deal with where we are today (not educating EE/CS types about reslience) * what's needed to go with this / how revolutionize curriclum * Critical Questions? (big risk items? ... more strategic questions?) * enable concurrent research in understanding low-level upset, fatigue effects with ever changing technology along with high-level mitigation * manage developer burden (avoid increasing) * must be careful not to add complexity that makes things worse (e.g. pushing more complexity into software without adequate validation that the software will handle appropriately) * how deal with legacy software (don't want to take this as an absolute mandate that inhibits innovation, but should be some thinking about how to handle things without complete rewrite) * ??? others * Metrics, Goals, Measure and manage programs * Goals of metrics * assess if proposed research proposing to attack the right problems? * measure if research making progress on solving the problem? * Some possible, primary metrics * Energy/Op at noise rate and performance target (Noise rate: defects, variation wear, transients) * Post-fab adaptability to range of noise rates * Timeliness and quality of adaptation * Recommendations from metrics group * Research Organization and Infrastructure (see discussion starter ResearchOrgInfra) * Examples and Illustrative Scenarios * Processor * SoC * High-level software * Layer discussion and community contributions: probably good to have a clear description of the layers and callout the wide range of communities that could participate in this research * Process and Participants * summarize activities of group (meetings, wiki, focus groups...) * comprehensive list of participants * Related efforts * Germany (Nassif mentioned) * Japan (Carter have pointer?) * UK ?? ------------------------------------------------------------------------------------------------- === Appendices === ==== a: Research Solicitation ==== Target: 2p (''This should be a stand-alone piece. While any actual research solicitation would be assembled by the respective program managers, putting together a draft like this is a good discipline for making sure that the details we provide here are adequate to support such a solicitation.'') * outline key challenge ahead * kind of research solicited * program structure (? program coverage of promising directions?) * evaluation criteria ==== b: Roadmap Report ==== ==== c: Metrics Report ==== ==== d: Constituency Group Report ==== 1. [[attachment:Challenges/consumer_challenge.pdf|Consumer Electronics]] 3. [[attachment:Challenges/aerospace_challenge.pdf|Aerospace]] 3. [[attachment:Challenges/largescale_challenge.pdf|Large-Scale Systems]] 4. [[attachment:Challenges/lifecritical_challenge.pdf|Life Critical]] 5. [[Challenges/Infrastructure|Infrastructure]]