Publications

(2021). Data-driven Application-oriented Reliability Model of a High-Performance Computing System. IEEE Transactions on Reliability.

PDF DOI

(2021). Delay Sensitivity-driven Congestion Mitigation for HPC Systems. Proceedings of the 34th ACM International Conference on Supercomputing. 2021. (ICS'21).

PDF

(2020). BayesPerf: Minimizing Performance Monitoring Errors Using Bayesian Statistics. The 26th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'21).

PDF

(2020). FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-oriented Microservices. Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI'20).

PDF

(2020). AV-FUZZER: Finding safety violations in autonomous driving systems. Proceedings of the IEEE International Conference on Software Reliability Engineering (ISSRE'20).

PDF Best Paper

(2020). Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems. Proceedings of the International Conference for High-Performance Computing, Networking, Storage and Analysis (SC 2020).

PDF Best Student Paper and Best Paper Nomination

(2020). Inductive Bias-driven Reinforcement Learning For Efficient Schedules in Heterogeneous Clusters. Thirty-seventh International Conference on Machine Learning (ICML 2020).

PDF DOI

(2020). ML-driven Malware that Targets AV Safety. 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

PDF

(2020). Modeling Communication Latency in High-speed Interconnection Networks. 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20).

PDF

(2020). Measuring Congestion in High-Performance Datacenter Interconnects . 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20).

PDF Code Dataset Slides DOI

(2020). The Mystery of the Failing Jobs: Insights from Operational Data from Two University-Wide Computing Systems. 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

PDF Dataset

(2019). Holistic Measurement driven System Assessment. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.

Code Poster

(2019). A Study of Network Congestion in Two Supercomputing High-Speed Interconnects. 2019 IEEE 26th Annual Symposium on High-Performance Interconnects (HOTI).

PDF Slides DOI

(2019). ML-Based Fault Injection for Autonomous Vehicles: A Case for Bayesian Fault Injection. 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

PDF DOI

(2019). Understanding Fault Scenarios and Impacts through Fault Injection Experiments in Cielo. arXiv e-prints.

PDF DOI

(2019). Monet - Blue Waters Network Dataset. University of Illinois at Urbana-Champaign.

PDF DOI

(2018). Resiliency of HPC Interconnects: A Case Study of Interconnect Failures and Recovery in Blue Waters. IEEE Transactions on Dependable and Secure Computing.

PDF DOI

(2018). Hands Off the Wheel in Autonomous Vehicles?: A Systems Perspective on over a Million Miles of Field Data. 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

PDF Code DOI

(2018). AVFI: Fault Injection for Autonomous Vehicles. 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W).

PDF Code DOI

(2018). Fault Injection-based System to Assess the Safety and Reliability of Autonomous Vehicles to Faults and Errors. IEEE International Workshop on Automotive Reliability and Test.

PDF DOI

(2017). Holistic Measurement-Driven System Assessment. 2017 IEEE International Conference on Cluster Computing (CLUSTER).

(2016). Analysis of Gemini Interconnect Recovery Mechanisms: Methods and Observations. Cray User Group.

PDF

(2015). Improving Main Memory Hash Joins on Intel Xeon Phi Processors: An Experimental Approach. Proc. VLDB Endowment.

PDF Code DOI

(2014). BbmTTP: Beat-based Parallel Simulated Annealing Algorithm on GPGPUs for the Mirrored Traveling Tournament Problem. Proceedings of the High Performance Computing Symposium.

PDF

(2013). Resiliency for Extreme Scale Systems. CSL Student Conference 2016.

Best Poster

(2013). P-HGRMS: A Parallel Hypergraph Based Root Mean Square Algorithm for Image Denoising. High Performance Parallel and Distributed Computing.

Best Poster

(2013). Exploiting Data Parallelism in the YConvex Hypergraph Algorithm for Image Representation Using GPGPUs. Proceedings of the 27th International ACM Conference on International Conference on Supercomputing.

PDF DOI