globaljournals.org
[PDF] Silent Data Errors in GPUs: Challenges and Mitigation in Modern ...
Excerpt
Modern Silicon By Sameeksha Gupta Abstract- Silent data errors in graphics processing units (SDEs) represent a critical challenge for modern computational systems that rely on these accelerators in high-performance computing, artificial intelligence, and data center operations. These errors propagate through calculations without triggering detection mechanisms, potentially compromising results in critical applications … Variations Timing violations Higher impact in large dies Critical timing paths, marginal circuits Thermal Stress Electromigration acceleration Exacerbated by workload variation Interconnect structures, package interfaces Voltage Fluctuations Timing margin violations Worsens with efficiency
Related Pain Points
Silent data errors in GPU computations
9Silent data errors (SDEs) in GPUs propagate through calculations without triggering detection mechanisms, potentially compromising results in critical applications. These errors stem from timing violations, thermal stress, electromigration, and voltage fluctuations on modern silicon.
Manufacturing defects and silicon variations in GPUs
7Manufacturing defects and silicon imperfections account for 13% of GPU failures in AI clusters, typically manifesting early in operational life. These stem from timing variations, thermal stress, and electromigration acceleration during high-utilization deep learning workloads.