Characterizing Supercomputer Traffic Networks Through Link-Level Analysis

Abstract

We present techniques for characterizing bandwidth and congestion characteristics of supercomputer High-Speed Networks (HSN). By utilizing a link-level perspective, we gain generality over analyses which are tied to specific topologies. We illustrate these techniques using five months of a Blue Waters production dataset consisting of network utilization and congestion counters. We find that: i) execution time of the communication-heavy applications is highly correlated to network stalls observed in the network topology and increase in application runtime can be as high as 1.7x with nominal increase in stalls, ii) heterogeneity in the available link bandwidth in the network can lead to backpressure and congestion even when the network is not underprovisioned, and (iii) links connected to I/O nodes are no more likely to observe congestion during operational hours than any other link in the system.

Publication
2018 IEEE International Conference on Cluster Computing (HPCMASPA, CLUSTER)