![](/rp/kFAqShRrnkQMbH6NYLBYoJ3lq9s.png)
David Fiala
David Fiala, Frank Mueller, Kurt B. Ferreira, Christian Engelmann 30th ACM International Conference on Supercomputing (ICS) (Istanbul, Turkey, June 2016) Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing ( PDF )
David Fiala 1, Kurt B. Ferreira2, Frank Mueller , and Christian Engelmann3 1 Department of Computer Science, North Carolina State University f dfiala | fmuelle [email protected]
David Fiala, Frank Mueller North Carolina State University Raleigh, NC fdfiala,[email protected] Christian Engelmann Oak Ridge Natl Lab Oak Ridge, TN [email protected] Rolf Riesen IBM Ireland Dublin, Ireland [email protected] Kurt Ferreira, Ron Brightwell Sandia Natl Labs Albuquerque, NM fkbferre,[email protected]
[10] Kurt Ferreira, Kevin Pedretti, Patrick Bridges, Ron Brightwell, David Fiala, and Frank Mueller. Eval-uating operating system vulnerability to memory errors. in Proceedings of the International Workshop on Runtime and Operating Systems for Supercomputers, 2012. [11] David Fiala, Kurt B. Ferreira, Frank Mueller, and Christian Engelmann.
David Fiala (NCSU), Kurt Ferreira (SNL), Frank Mueller (NCSU), Christian Engelmann (ORNL) Motivation Silent Data Corruption (SDC) undetected soft errors that result in corruption in storage (Processor, Cache, Disks, RAM, etc) SDC faults may manifest themselves as bit -flips in memory
Combining Partial Redundancy and Checkpointing for HPC James Elliott∗, Kishor Kharbas∗, David Fiala∗, Frank Mueller∗, Kurt Ferreira† and Christian Engelmann‡ ∗ North Carolina State University, Raleigh, NC, [email protected] † Scalable System Software, Sandia National Laboratories Albuquerque, NM, [email protected] ‡ Computer Science and Mathematics Division, Oak Ridge ...
David Fiala and Frank Mueller Dept. of Computer Science North Carolina State University Email: [email protected] Kurt B. Ferreira Center for Computing Research Sandia National Laboratories Email: [email protected] Abstract—Proposed exascale systems will present consider-able challenges. In particular, DRAM soft-errors, or bit-flips,
Exploiting Content Similarity to Improve Memory Performance in Exascale Systems Scott Levy 1, Kurt B. Ferreira2, Patrick G. Bridges , Dorian Arnold , and David Fiala3 1 Department of Computer Science, University of New Mexico 2 Scalable System Software, Sandia National Laboratoriesy 3 Department of Computer Science, North Carolina State University 1 Background & Motivation
David Fiala Advisor: Frank Mueller (NCSU) Collaborators: Christian Engelmann (ORNL), Rolf Riesen, Kurt Ferreira (SNL) MOTIVATION Design and implementation of efficient mechanisms for fault tolerance in HPC o Propose efficient protocols for SDC protection o Investigate the cost of different levels of redundancy
David Fiala, Frank Mueller NCSU Raleigh, NC fdfiala,[email protected] Christian Engelmann Oak Ridge Natl Lab Oak Ridge, TN [email protected] Kurt Ferreira, Ron Brightwell Sandia Natl Labs Albuquerque, NM [email protected] Rolf Riesen IBM Dublin, Ireland [email protected] Abstract—Faults have become the norm rather than the