Publications | Compiler Lab

2024

Real-time Digital RF Emulation – II: A Near Memory Custom Accelerator

X. Mao, M. Mukherjee, N. Mizanur Rahman, C. DeLude, J. Driscoll, and 14 more authors

IEEE Transactions on Radar Systems, 2024

Abs

A near memory hardware accelerator, based on a novel direct path computational model, for real-time emulation of radio frequency systems is demonstrated. Our evaluation of hardware performance uses both application-specific integrated circuits (ASIC) and field programmable gate arrays (FPGA) methodologies: 1). The ASIC test-chip implementation, using TSMC 28nm CMOS, leverages distributed autonomous control to extract concurrency in compute as well as low latency. It achieves a 518 MHz per channel bandwidth in a prototype 4-node system. The maximum emulation range supported in this paradigm is 9.5 km with 0.24 μs of per-sample emulation latency. 2). The FPGA-based implementation, evaluated on a Xilinx ZCU104 board, demonstrates a 9-node test case (two Transmitters, one Receiver, and 6 passive reflectors) with an emulation range of 1.13 km to 27.3 km at 215 MHz bandwidth.
Real-Time Digital RF Emulation—Part I: The Direct Path Computational Model

C. DeLude, J. Driscoll, M. Mukherjee, N. Rahman, X. Mao, and 9 more authors

IEEE Transactions on Radar Systems, 2024

Abs

In this article, we consider the problem of developing a computational model for emulating an RF channel. The motivation for this is that an accurate and scalable emulator has the potential to minimize the need for field testing, which is expensive, slow, and difficult to replicate. Traditionally, emulators are built using a tapped delay line (TDL) model where long filters modeling the physical interactions of objects are implemented directly. For an emulation scenario consisting of M objects all interacting with one another, the TDL model’s computational requirements scale as O(M^3) per sample: there are O(M^2) channels, each with O(M) complexity. In this article, we develop a new “direct path” model that, while remaining physically faithful, allows us to carefully factor the emulator operations, resulting in an O(M^2) per-sample scaling of the computational requirements. The impact of this is drastic, a 200-object scenario sees about a 100\times reduction in the number of per-sample computations. Furthermore, the direct path model gives us a natural way to distribute the computations for an emulation: each object is mapped to a computational node, and these nodes are networked in a fully connected communication graph. Alongside a discussion of the model and the physical phenomena it emulates, we show how to efficiently parameterize antenna responses and scattering profiles within this direct path framework. To verify the model and demonstrate its viability in hardware, we provide several numerical experiments produced using a cycle-level C++ simulator of a hardware implementation of the model.
Improving Program Debloating with 1-DU Chain Minimality

Myeongsoo Kim, Santosh Pande, and Alessandro Orso

In Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, Lisbon, Portugal, 2024

Abs

Modern software often struggles with bloat, leading to increased memory consumption and security vulnerabilities from unused code. In response, various program debloating techniques have been developed, typically utilizing test cases that represent functionalities users want to retain. These methods range from aggressive approaches, which prioritize maximal code reduction but may overfit to test cases and potentially reintroduce past security issues, to conservative strategies that aim to preserve all influenced code, often at the expense of less effective bloat reduction and security improvement. In this research, we present RLDebloatDU, an innovative debloating technique that employs 1-DU chain minimality within abstract syntax trees. Our approach maintains essential program data dependencies, striking a balance between aggressive code reduction and the preservation of program semantics. We evaluated RLDebloatDU on ten Linux kernel programs, comparing its performance with two leading debloating techniques: Chisel, known for its aggressive debloating approach, and Razor, recognized for its conservative strategy. RLDebloatDU significantly lowers the incidence of Common Vulnerabilities and Exposures (CVEs) and improves soundness compared to both, highlighting its efficacy in reducing security issues without reintroducing resolved security issues.
Pythia: Compiler-Guided Defense Against Non-Control Data Attacks

Sharjeel Khan, Bodhisatwa Chatterjee, and Santosh Pande

In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, La Jolla, CA, USA, 2024

Abs

Modern C/C++ applications are susceptible to Non-Control Data Attacks, where an adversary attempts to exploit memory corruption vulnerabilities for security breaches such as privilege escalation, control-flow manipulation, etc. One such popular class of non-control data attacks is Control-flow Bending, where the attacker manipulates the program data to flip branch outcomes, and divert the program control flow into alternative paths to gain privileges. Unfortunately, despite tremendous advancements in software security, state-of-art defense mechanisms such as Control-flow Integrity (CFI), are ineffective against control-flow bending attacks especially those involving flipping of branch predicates.In this work, we propose a performance-aware defensive scheme, Pythia, which utilizes ARM’s pointer authentication (PA) hardware support for dynamic integrity checking of forward slices of vulnerable program variables that can be affected by input channels, and backward slices of branch variables (including pointers) that may be misused by the attacker. We first show that a naive scheme of protecting all vulnerabilities can suffer from an average runtime overhead of 47.88%. We then demonstrate how overheads can be minimized by selectively adding ARM PA-based canaries for statically-allocated program variables and creating secure sections for dynamically-allocated variables to avoid overflows by input channels. Our experiments show that employing this hybrid approach results in an average overhead to 13.07%. We empirically show that Pythia offers better security than state-of-the-art data flow integrity (DFI) technique, especially in pointer-intensive code and C++ applications with 92% branches secured on an average and 100 % secured in case of 3 applications.

2023

Beacons: An End-to-End Compiler Framework for Predicting and Utilizing Dynamic Loop Characteristics

Girish Mururu, Sharjeel Khan, Bodhisatwa Chatterjee, Chao Chen, Chris Porter, and 2 more authors

Proc. ACM Program. Lang., Oct 2023

Abs

Efficient management of shared resources is a critical problem in high-performance computing (HPC) environments. Existing workload management systems often promote non-sharing of resources among different co-executing applications to achieve performance isolation. Such schemes lead to poor resource utilization and suboptimal process throughput, adversely affecting user productivity. Tackling this problem in a scalable fashion is extremely challenging, since it requires the workload scheduler to possess an in-depth knowledge about various application resource requirements and runtime phases at fine granularities within individual applications. In this work, we show that applications’ resource requirements and execution phase behaviour can be captured in a scalable and lightweight manner at runtime by estimating important program artifacts termed as “dynamic loop characteristics”. Specifically, we propose a solution to the problem of efficient workload scheduling by designing a compiler and runtime cooperative framework that leverages novel loop-based compiler analysis for resource allocation. We present Beacons Framework, an end-to-end compiler and scheduling framework, that estimates dynamic loop characteristics, encapsulates them in compiler-instrumented beacons in an application, and broadcasts them during application runtime, for proactive workload scheduling. We focus on estimating four important loop characteristics: loop trip-count, loop timing, loop memory footprint, and loop data-reuse behaviour, through a combination of compiler analysis and machine learning. The novelty of the Beacons Framework also lies in its ability to tackle irregular loops that exhibit complex control flow with indeterminate loop bounds involving structure fields, aliased variables and function calls, which are highly prevalent in modern workloads. At the backend, Beacons Framework entails a proactive workload scheduler that leverages the runtime information to orchestrate aggressive process co-locations, for maximizing resource concurrency, without causing cache thrashing. Our results show that Beacons Framework can predict different loop characteristics with an accuracy of 85% to 95% on average, and the proactive scheduler obtains an average throughput improvement of 1.9x (up to 3.2x) over the state-of-the-art schedulers on an Amazon Graviton2 machine on consolidated workloads involving 1000-10000 co-executing processes, across 51 benchmarks.
Decker: Attack Surface Reduction via On-Demand Code Mapping

Chris Porter, Sharjeel Khan, and Santosh Pande

In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, Vancouver, BC, Canada, Oct 2023

Abs

Modern code reuse attacks take full advantage of bloated software. Attackers piece together short sequences of instructions in otherwise benign code to carry out malicious actions. Mitigating these reusable code snippets, known as gadgets, has become one of the prime focuses of attack surface reduction research. While some debloating techniques remove parts of software that contain such gadgets, other methods focus on making them unusable by breaking up chains of them, thereby substantially diminishing the possibility of code reuse attacks. Third-party libraries are another main focus, because they exhibit a high number of vulnerabilities, but recently, techniques have emerged that deal with whole applications. Attack surface reduction efforts have typically tried to eliminate such attacks by subsetting (debloating) the application, e.g. via user-specified inputs, configurations, or features to achieve high gadget reductions. However, such techniques suffer from the limitations of soundness, i.e. the software might crash during no-attack executions on regular inputs, or they may be conservative and leave a large amount of attack surface untackled. In this work we present a general, whole-program attack surface reduction technique called Decker that significantly reduces gadgets which are accessible to an attacker during an execution phase (called a deck) and has minor performance degradation. Decker requires no user inputs and leaves all features intact. It uses static analysis to determine key function sets that should be enabled/disabled at runtime. The runtime system then enables these function sets at the specified program points during execution. On SPEC CPU 2017, our framework achieves 73.2% total gadget reduction with 5.2% average slowdown. On 10 GNU coreutils applications, it achieves 87.2% reduction and negligible slowdown. On the nginx server it achieves 80.3% reduction with 2% slowdown. We also provide a gadget chain-breaking case study, including detailed JOP gadget metrics on both Linux and Windows, and show that our framework breaks the shell-spawning chain in all cases.
PinIt: Influencing OS Scheduling via Compiler-Induced Affinities

Girish Mururu, Kangqi Ni, Ada Gavrilovska, and Santosh Pande

In Proceedings of the 24th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems, Orlando, FL, USA, Oct 2023

Abs

In multi-core machines, applications execute in a complex-co-execution environment in which the number of concurrently executing applications typically exceed the number of available cores. In order to fairly and efficiently utilize cores, modern operating systems (OS) such as Linux migrate threads between cores during execution. Although such thread migrations alleviate the problem of stalling and load balancing yielding better core utilization, they also tend to destroy data locality, resulting in fewer cache hits, TLB hits, and thus performance loss for the group of applications collectively. This problem is especially severe in embedded servers which execute media and vision applications that exhibit high data locality. One one hand, mitigating this problem across a group of applications based on OS only solution is infeasible since OS treats applications as blackboxes and has no knowledge of its locality and other behavior. On the other hand, to-date, compiler optimization have focused on analysis, transformations and performance enhancement of applications in isolation ignoring the problem of optimizing performance for applications as a group. This is because of the infeasibility of global-compiler analysis across applications as well as due to the dynamic nature of inter-application interactions which is statically unknown. To address this problem, we propose PinIt, a compiler-directed methodology that analyzes applications individually yet induces the operating system to mediate actions across applications to minimize harmful migrations and maintains locality. PinIt determines the regions of a program in which the process should be pinned onto a core so that adverse migrations causing excessive cache and TLB misses are avoided. PinIt first calculates memory reuse density, a new measure that quantifies the reuses within code regions. Pin/unpin calls are then hoisted at the entry and exits of the region which exhibit high values of reuse density. The paper presents new analyses and transformations that optimize the placement of such calls. In an overloaded environment compared to priority-cfs, PinIt speeds up high-priority applications in Mediabench workloads by 1.16x and 2.12x and in vision-based workloads by 1.35x and 1.23x on 8cores and 16cores, respectively, with almost the same or better throughput for low-priority applications.
Practical compilation of fexprs using partial evaluation: Fexprs can performantly replace macros in purely-functional Lisp

Nathan Braswell, Sharjeel Khan, and Santosh Pande

Oct 2023
Com-CAS: Effective Cache Apportioning under Compiler Guidance

Bodhisatwa Chatterjee, Sharjeel Khan, and Santosh Pande

In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, Chicago, Illinois, Oct 2023

Abs

With a growing number of cores in modern high-performance servers, effective sharing of the last level cache (LLC) is more critical than ever. The primary agenda of such systems is to maximize performance by efficiently supporting multi-tenancy of diverse workloads. However, this could be particularly challenging to achieve in practice, because modern workloads exhibit dynamic phase behaviour, which causes their cache requirements & sensitivities to vary at finer granularities during execution. Unfortunately, existing systems are oblivious to the application phase behavior, and are unable to detect and react quickly enough to these rapidly changing cache requirements, often incurring significant performance degradation.In this paper, we propose Com-CAS, a new apportioning system that provides dynamic cache allocations for co-executing applications. Com-CAS differs from the existing cache partitioning systems by adapting to the dynamic cache requirements of applications just-in-time, as opposed to reacting, without any hardware modifications. The front-end of Com-CAS consists of compiler-analysis equipped with machine learning mechanisms to predict cache requirements, while the back-end consists of proactive scheduler that dynamically apportions LLC amongst co-executing applications leveraging Intel Cache Allocation Technology (CAT). Com-CAS’s partitioning scheme utilizes the compiler-generated information across finer granularities to predict the rapidly changing dynamic application behaviors, while simultaneously maintaining data locality. Our experiments show that Com-CAS improves average weighted throughput by 15% over unpartitioned cache system, and outperforms state-of-the-art partitioning system KPart [6] by 20%, while maintaining the worst individual application completion time degradation to meet various Service-Level Agreement (SLA) requirements.

2022

Near-Zero Downtime Recovery From Transient-Error-Induced Crashes

Chao Chen, Greg Eisenhauer, and Santosh Pande

IEEE Transactions on Parallel and Distributed Systems, Apr 2022

Abs

Due to the system scaling, transient errors caused by external noise, e.g., heat fluxes and particle strikes, have become a growing concern for the current and upcoming exa-scale high-performance-computing (HPC) systems. Applications running on these systems are expected to experience transient errors more frequently than ever before, which will either lead them to generate incorrect outputs or cause them to crash. However, since such errors are still quite rare as compared to no-fault cases, desirable solutions call for low/no-overhead systems that do not compromise the performance under no-fault conditions and also allow very fast fault recovery to minimize downtime. In this article, we present IterPro, a light-weight compiler-assisted resilience technique to quickly and accurately recover processes from transient-error-induced crashes. During the compilation of applications, IterPro constructs a set of recovery kernels for crash-prone instructions. These recovery kernels are executed to repair the corrupted process states on-the-fly upon occurrences of errors, enabling applications to continue their executions instead of being terminated. When constructing recovery kernels, IterPro exploits side effects introduced by induction variable based code optimization techniques based on loop unrolling and strength reduction to improve its recovery capability. To this end, two new code transformation passes are introduced to expose the side effects for resilience purposes. We evaluated IterPro with 4 scientific workloads as well as the NPB benchmarks suite. During their normal execution, IterPro incurs almost zero runtime overhead and a small, fixed 27MB memory overhead. Meanwhile, IterPro can recover on an average 83.55 percent of crash-causing errors within dozens of milliseconds with negligible downtime. We also evaluated IterPro with parallel jobs running on 3072 cores and showed that IterPro can successfully mask the impact of crash-causing errors by providing almost uninterrupted execution. Finally, we present our preliminary evaluation result for BLAS, which shows that IterPro is capable of recovering failures in libraries with a very high coverage rate of 83 percent and negligible overheads. With such an effective recovery mechanism, IterPro could tremendously mitigate the overheads and resource requirements of the resilience subsystem in future exa-scale systems.
VICO: demand-driven verification for improving compiler optimizations

Sharjeel Khan, Bodhisatwa Chatterjee, and Santosh Pande

In Proceedings of the 36th ACM International Conference on Supercomputing, Virtual Event, Apr 2022

Abs

In spite of tremendous advances in data dependence and dataflow analysis techniques, state-of-the-art optimizing compilers continue to suffer from imprecisions and miss potential optimization opportunities. These imprecisions result from statically unknown characteristics of variables that participate in the dependence systems, or aliases that affect key safety properties, which must be conservatively assumed. However, with the increased tractability of verification on modern systems, a demand-driven solution to this problem can be envisioned. In this work, we model loop optimization constraints as loop-invariants, with the goal of proving their runtime behaviour under all inputs. Our proposed framework VICO, first detects the unresolved constraints whose conservative assumption negatively affects specific compiler optimizations. These constraints are then modeled as potential invariants and are verified on a demand-driven basis. Finally, VICO incorporates the verified invariants in the analysis, which results in superior optimization. For this purpose, VICO converts conservative constraints identified by the LLVM compiler and parallelization tool PLuTo, to potential invariants, that are further verified by SMACK verification tool. Following such an approach enables us to target numerous optimizations at different compilation phases - automatic parallelization and loop transformations at the source-level, and register allocation, and global value numbering (GVN), at the IR-level. Our results show that VICO improves the precision of dependence analysis by 45% in real-world cases, leading to superior optimization in over 75 loops in different scenarios like mathematical simulations and solvers. The improvement in dependence precision led to an average speedup of 14.7x on Apple M1 Pro and 6.07x on Intel Xeon E5-2660 systems. In addition, VICO also enhances LLVM’s alias analysis leading to improvements in LLVM backend optimizations and decreased code size by 4% alongside improved execution time by 2.2% in numerous linux programs and SPEC benchmarks with (mostly) low verification time.
Compiler-assisted scheduling for multi-instance GPUs

Chris Porter, Chao Chen, and Santosh Pande

In Proceedings of the 14th Workshop on General Purpose Processing Using GPU, Seoul, Republic of Korea, Apr 2022

Abs

NVIDIA’s Multi-Instance GPU (MIG) feature allows users to partition a GPU’s compute and memory into independent hardware instances. MIG guarantees full isolation among co-executing kernels on the device, which boosts security and prevents performance interference-related degradation. Despite the benefits of isolation, however, certain workloads do not necessarily need such guarantees, and in fact enforcing such isolation can negatively impact the throughput of a group of processes. In this work we aim to relax the isolation property for certain types of jobs, and to show how this can dramatically boost throughput across a mixed workload consisting of jobs that demand isolation and others that do not. The number of MIG partitions is hardware-limited but configurable, and state-of-the-art workload managers cannot safely take advantage of unused and wasted resources inside a given partition. We show how a compiler and runtime system working in tandem can be used to pack jobs into partitions when isolation is not necessary. Using this technique we improve overall utilization of the device while still reaping the benefits of MIG’s isolation properties. Our experimental results on NVIDIA A30s with a throughput-oriented workload show an average of 1.45x throughput improvement and 2.93x increase in GPU memory utilization over the Slurm workload manager. The presented framework is fully automatic and requires no changes to user code. Based on these results, we believe our scheme is a practical and strong advancement over state-of-the-art techniques currently employed for MIG.
CASE: a compiler-assisted SchEduling framework for multi-GPU systems

Chao Chen, Chris Porter, and Santosh Pande

In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Seoul, Republic of Korea, Apr 2022

Abs

Modern computing platforms tend to deploy multiple GPUs on a single node to boost performance. GPUs have large computing capacities and are an expensive resource. Increasing their utilization without causing performance degradation of individual workloads is an important and challenging problem. Although services such as NVIDIA’s MPS allow multiple cooperative kernels to simultaneously run on a single device, they do not solve the co-execution problem for uncooperative, independent kernels on such a multi-GPU system. To tackle this problem, we propose CASE — a fully automated compiler-assisted scheduling framework. During the compilation of an application, CASE constructs GPU tasks from CUDA programs and instruments the code with a probe before each one. At runtime, each probe conveys information about its task’s resource requirements such as memory and the number of streaming multiprocessor (SMs) needed to a user-level scheduler. The scheduler then places each task onto a suitable device by employing a policy appropriate to the system. In our prototype, a throughput-oriented scheduling policy is implemented to evaluate our resource-aware scheduling framework. The Rodinia benchmark suite and the Darknet neural network framework were used in our evaluation. The results show that, as compared to existing state-of-the-art methods, CASE improves throughput by up to 2.5X for Rodinia, and up to 2.7X for Darknet on modern NVIDIA GPU platforms, mainly due to the fact that it improves the average system utilization by up to 3.36X and the job turnaround time by up to 4.9X. Meanwhile, it limits individual kernel performance degradation within 2.5%. CASE achieved peak system utilization of 78% for Rodinia and 80% for Darknet on a 4XV100 system.
Effective Cache Apportioning for Performance Isolation Under Compiler Guidance

Bodhisatwa Chatterjee, Sharjeel Khan, and Santosh Pande

Apr 2022

2021

Not so fast: understanding and mitigating negative impacts of compiler optimizations on code reuse gadget sets

Michael D. Brown, Matthew Pruett, Robert Bigelow, Girish Mururu, and Santosh Pande

Proc. ACM Program. Lang., Oct 2021

Abs

Despite extensive testing and correctness certification of their functional semantics, a number of compiler optimizations have been shown to violate security guarantees implemented in source code. While prior work has shed light on how such optimizations may introduce semantic security weaknesses into programs, there remains a significant knowledge gap concerning the impacts of compiler optimizations on non-semantic properties with security implications. In particular, little is currently known about how code generation and optimization decisions made by the compiler affect the availability and utility of reusable code segments called gadgets required for implementing code reuse attack methods such as return-oriented programming. In this paper, we bridge this gap through a study of the impacts of compiler optimization on code reuse gadget sets. We analyze and compare 1,187 variants of 20 different benchmark programs built with two production compilers (GCC and Clang) to determine how their optimization behaviors affect the code reuse gadget sets present in program variants with respect to both quantitative and qualitative metrics. Our study exposes an important and unexpected problem; compiler optimizations introduce new gadgets at a high rate and produce code containing gadget sets that are generally more useful to an attacker than those in unoptimized code. Using differential binary analysis, we identify several undesirable behaviors at the root of this phenomenon. In turn, we propose and evaluate several strategies to mitigate these behaviors. In particular, we show that post-production binary recompilation can effectively mitigate these behaviors with negligible performance impacts, resulting in optimized code with significantly smaller and less useful gadget sets.
Distributed Work Stealing at Scale via Matchmaking

Hrushit Parikh, Vinit Deodhar, Ada Gavrilovska, and Santosh Pande

In 2021 IEEE International Conference on Cluster Computing (CLUSTER), Sep 2021

Abs

Many classes of high-performance applications and combinatorial problems exhibit large degree of imbalance in terms of task execution times. One approach to achieving high machine efficiency and balanced resource use is to over-decompose the problem into fine-grained tasks that are then dynamically re-balanced across the system by approaches such as workstealing. For such irregular applications, existing work stealing or work balancing techniques exhibit high overheads due to potentially excessive communication messages, and/or delays experienced by idle nodes in finding work due to repeated failed steals. In particular, on very large scale clusters, the problem of matching work generation and work consumption is extremely challenging. At the heart of the problem is the dilemma: idle workers do not know where to look for work and the work producers do not know where to send the generated extra work. We contend that the fundamental problem of distributed work-stealing is not one of rapid dissemination of availability of work but one of rapidly bringing together work producers and consumers. In response, we develop a workstealing-based algorithm that performs timely, lightweight and highly efficient matchmaking between work producers and consumers. The matchmaker simultaneously accumulates the work availability messages from the producer nodes and rapidly redirects messages of work availability to the consumers. The matchmaker is a regular worker which works on its own work queue executing tasks when it pauses from the matchmaking work. We validate these claims via an implementation of a match-making-based scheduler for Charm++, that we evaluate using representative benchmarks running on up to 8K cores. Results show that our scheduler is able to outperform other load balancers and distributed work stealing schedulers, and to achieve scale beyond what is possible with current approaches.
Near-zero Downtime Recovery from Transient-error-induced Crashes

Chao Chen, Greg Eisenhauer, and Santosh Pande

Sep 2021
Compiler-Guided Throughput Scheduling for Many-core Machines

Girish Mururu, Sharjeel Khan, Bodhisatwa Chatterjee, Chao Chen, Chris Porter, and 2 more authors

Sep 2021
Effective GPU Sharing Under Compiler Guidance

Chao Chen, Chris Porter, and Santosh Pande

Sep 2021
On-the-fly Code Activation for Attack Surface Reduction

Chris Porter, Sharjeel Khan, and Santosh Pande

Sep 2021
Not so fast: understanding and mitigating negative impacts of compiler optimizations on code reuse gadget sets

Michael D. Brown, Matthew Pruett, Robert Bigelow, Girish Mururu, and Santosh Pande

Proceedings of the ACM on Programming Languages, Oct 2021

2020

Generating Robust Parallel Programs via Model Driven Prediction of Compiler Optimizations for Non-determinism

Girish Mururu, Kaushik Ravichandran, Ada Gavrilovska, and Santosh Pande

In Proceedings of the 49th International Conference on Parallel Processing, Edmonton, AB, Canada, Oct 2020

Abs

Execution orders in parallel programs are governed by non-determinism and can vary substantially across different executions even on the same input. Thus, a highly non-deterministic program can exhibit rare execution orders never observed during testing. It is desirable to reduce non-determinism to suppress corner case behavior in production cycle (making the execution robust or bug-free) and increase non-determinism for reproducing bugs in the development cycle. Performance-wise different optimization levels (e.g. from O0 to O3) are enabled during development , however, non-determinism-wise, developers have no way to select right compiler optimization level in order to increase non-determinism for debugging or to decrease it for robustness. The major source of non-determinism is the underlying execution model, primarily determined by the processor architecture and the operating system (OS). Architectural artifacts such as cache misses and TLB misses characterize and shape the non-determinism. In this work, we seek to capture such sources of non-determinism through an architectural model based on hardware performance counters and use the model for predicting the appropriate compiler optimization level for generating a robust parallel program, which has minimal non-determinism in production. As a side effect, the generated model also allows maximizing non-determinism for debugging purposes. We demonstrate our technique on the PARSEC benchmark suite, and among other results show that the generated robust program decreases non-deterministic behavior up to 66.48%, and as a practical measure we also show that a known race condition plus randomly injected ones are rendered benign in the robust parallel program generated by our framework.
BlankIt library debloating: getting what you want instead of cutting what you don’t

Chris Porter, Girish Mururu, Prithayan Barua, and Santosh Pande

In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation, London, UK, Oct 2020

Abs

Modern software systems make extensive use of libraries derived from C and C++. Because of the lack of memory safety in these languages, however, the libraries may suffer from vulnerabilities, which can expose the applications to potential attacks. For example, a very large number of return-oriented programming gadgets exist in glibc that allow stitching together semantically valid but malicious Turing-complete and -incomplete programs. While CVEs get discovered and often patched and remedied, such gadgets serve as building blocks of future undiscovered attacks, opening an ever-growing set of possibilities for generating malicious programs. Thus, significant reduction in the quantity and expressiveness (utility) of such gadgets for libraries is an important problem. In this work, we propose a new approach for handling an application’s library functions that focuses on the principle of “getting only what you want.” This is a significant departure from the current approaches that focus on “cutting what is unwanted.” Our approach focuses on activating/deactivating library functions on demand in order to reduce the dynamically linked code surface, so that the possibilities of constructing malicious programs diminishes substantially. The key idea is to load only the set of library functions that will be used at each library call site within the application at runtime. This approach of demand-driven loading relies on an input-aware oracle that predicts a near-exact set of library functions needed at a given call site during the execution. The predicted functions are loaded just in time and unloaded on return. We present a decision-tree based predictor, which acts as an oracle, and an optimized runtime system, which works directly with library binaries like GNU libc and libstdc++. We show that on average, the proposed scheme cuts the exposed code surface of libraries by 97.2%, reduces ROP gadgets present in linked libraries by 97.9%, achieves a prediction accuracy in most cases of at least 97%, and adds a runtime overhead of 18% on all libraries (16% for glibc, 2% for others) across all benchmarks of SPEC 2006. Further, we demonstrate BlankIt on two real-world applications, sshd and nginx, with a high amount of debloating and low overheads.
A Compiler Assisted Scheduler for Detecting and Mitigating Cache-Based Side Channel Attacks

Sharjeel Khan, Girish Mururu, and Santosh Pande

Oct 2020

2019

Quantifying and Reducing Execution Variance in STM via Model Driven Commit Optimization

Girish Mururu, Ada Gavrilovska, and Santosh Pande

In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), Feb 2019

Abs

Simplified parallel programming coupled with an ability to express speculative computation is realized with Software Transactional Memory (STM). Although STMs are gaining popularity because of significant improvements in parallel performance, they exhibit enormous variation in transaction execution with non-repeatable performance behavior which is unacceptable in many application domains, especially in which frame rates and responsiveness should be predictable. In other domains reproducible transactional behavior helps towards system provisioning. Thus, reducing execution variance in STM is an important performance goal that has been mostly overlooked. In this work, we minimize the variance in execution time of threads in STM by reducing non-determinism exhibited due to speculation. We define the state of STM, and we use it to first quantity non-determinism and then generate an automaton that models the execution behavior of threads in STM. We finally use the state automaton to guide the STM to avoid non-predictable transactional behavior thus reducing non-determinism in roll-backs which in turn results in reduction in variance. We observed average reduction of variance in execution time of threads up to 74% in 16 cores and 53% in 8 cores by reducing non-determinism up to 24% in 16 cores and 44% in 8 cores, respectively, on STAMP benchmark suite while experiencing average slowdown of 4.8% in 8 cores and 19.2% in 16 cores. We also reduced the variance in frame rate by maximum of 65% on a version of real world game Quake3 without degradation in timing.
Characterizing Dominant Program Behavior Using the Execution-Time Variance of the Call Structure

Tushar Kumar, Kangqi Ni, and Santosh Pande

In 2019 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), Apr 2019

Abs

Traditional profiling techniques typically identify performance hot-spots. Other specialized techniques such as WCET analysis cater to safety critical real-time systems with hard constraints and thus make conservative assumptions. However, several domains motivate the need to systematically characterize and represent differential timing properties such as the variance in execution time of application artifacts such as the functions. Such domains include, vulnerability analysis of applications with regard to differential timing attacks, and the optimization of soft-real-time applications to reduce frame-rate fluctuations. In this paper, we motivate the need for execution variance as a performance measure and propose a variance-based analysis scheme. We introduce a new program representation called Variance Characterization Graph (VCG) that is used both as the intermediate representation for the variance-based analysis, and as the final representation that provides concise actionable information to programmers and optimization frameworks. We develop a methodology based on statistical pattern matching to summarize the dominant patterns of application behavior into a very compact VCG representation useful for tuning application behavior such as the soft-real-time properties.
CARE: compiler-assisted recovery from soft failures

Chao Chen, Greg Eisenhauer, Santosh Pande, and Qiang Guan

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Denver, Colorado, Apr 2019

Abs

As processors continue to boost the system performance with higher circuit density, shrinking process technology and near-threshold voltage (NTV) operations, they are projected to be more vulnerable to transient faults, which have become one of the major concerns for future extreme-scale HPC systems. Despite being relatively infrequent, crashes due to transient faults are incredibly disruptive, particularly for massively parallel jobs on supercomputers where they potentially kill the entire job, requiring an expensive rerun or restart from a checkpoint.In this paper, we present CARE, a light-weight compiler-assisted technique to repair the (crashed) process on-the-fly when a crash-causing error is detected, allowing applications to continue their executions instead of being simply terminated and restarted. Specifically, CARE seeks to repair failures that would result in application crashes due to invalid memory references (segmentation violation). During the compilation of applications, CARE constructs a recovery kernel for each crash-prone instruction, and upon an occurrence of an error, CARE attempts to repair corrupted state of the process by executing the constructed recovery kernel to recompute the memory reference on-the-fly. We evaluated CARE with four scientific workloads. During their normal execution, CARE incurs almost zero runtime overhead and a fixed 27MB memory overheads. Meanwhile, CARE can recover on an average 83.54% of crash-causing errors within dozens of milliseconds. We also evaluated CARE with parallel jobs running on 3072 cores and showed that CARE can successfully mask the impact of crash-causing errors by providing almost uninterrupted execution. Finally, We present our preliminary evaluation result for BLAS, which shows that CARE is capable of recovering failures in libraries with a very high coverage rate of 83% and negligible overheads. With such an effective recovery mechanism, CARE could tremendously mitigate the overheads and resource requirements of the resilience subsystem in future extreme-scale systems.
Is Less Really More? Towards Better Metrics for Measuring Security Improvements Realized Through Software Debloating

Michael D. Brown, and Santosh Pande

In 12th USENIX Workshop on Cyber Security Experimentation and Test (CSET 19), Aug 2019
Binary Debloating for Security via Demand Driven Loading

Girish Mururu, Chris Porter, Prithayan Barua, and Santosh Pande

Aug 2019
CARVE: Practical Security-Focused Software Debloating Using Simple Feature Set Mappings

Michael D. Brown, and Santosh Pande

In Proceedings of the 3rd ACM Workshop on Forming an Ecosystem Around Software Transformation, Nov 2019

2018

LADR: low-cost application-level detector for reducing silent output corruptions

Chao Chen, Greg Eisenhauer, Matthew Wolf, and Santosh Pande

In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, Tempe, Arizona, Nov 2018

Abs

Applications running on future high performance computing (HPC) systems are more likely to experience transient faults due to technology scaling trends with respect to higher circuit density, smaller transistor size and near-threshold voltage (NTV) operations. A transient fault could corrupt application state without warning, possibly leading to incorrect application output. Such errors are called silent data corruptions (SDCs).In this paper, we present LADR, a low-cost application-level SDC detector for scientific applications. LADR protects scientific applications from SDCs by watching for data anomalies in their state variables (those of scientific interest). It employs compile-time data-flow analysis to minimize the number of monitored variables, thereby reducing runtime and memory overheads while maintaining a high level of fault coverage with low false positive rates. We evaluated LADR with 4 scientific workloads and results show that LADR achieved < 80% fault coverage with only ∼ 3% runtime overheads and ∼ 1% memory overheads. As compared to prior state-of-the-art anomaly-based detection methods, SDC achieved comparable or improved fault coverage, but reduced runtime overheads by 21% ∼ 75%, and memory overheads by 35% ∼ 55% for the evaluated workloads. We believe that such an approach with low memory and runtime overheads coupled with attractive detection precision makes LADR a viable approach for assuring the correct output from large-scale high performance simulations.
Quantifying and reducing execution variance in STM via model driven commit optimization

Girish Mururu, Ada Gavrilovska, and Santosh Pande

In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Vienna, Austria, Nov 2018

Abs

Simplified parallel programming coupled with an ability to express speculative computation is realized with Software Transactional Memory (STM). Although STMs are gaining popularity because of significant improvements in parallel performance, they exhibit enormous variation in transaction execution with non-repeatable performance behavior which is unacceptable in many application domains, especially in which frame rates and responsiveness should be predictable. Thus, reducing execution variance in STM is an important performance goal that has been mostly overlooked. In this work, we minimize the variance in execution time of threads in STM by reducing non-determinism exhibited due to speculation by first quantifying non-determinism and generating an automaton that models the behavior of STM. We used the automaton to guide the STM to a less non-deterministic execution that reduced the variance in frame rate by a maximum of 65% on a version of real-world Quake3 game.

2016

Efficient distributed workstealing via matchmaking

Hrushit Parikh, Vinit Deodhar, Ada Gavrilovska, and Santosh Pande

In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Barcelona, Spain, Nov 2016

Abs

Many classes of high-performance applications and combinatorial problems exhibit large degree of runtime load variability. One approach to achieving balanced resource use is to over decompose the problem on fine-grained tasks that are then dynamically balanced using approaches such as workstealing. Existing work stealing techniques for such irregular applications, running on large clusters, exhibit high overheads due to potential untimely interruption of busy nodes, excessive communication messages and delays experienced by idle nodes in finding work due to repeated failed steals. We contend that the fundamental problem of distributed work-stealing is of rapidly bringing together work producers and consumers. In response, we develop an algorithm that performs timely, lightweight and highly efficient matchmaking between work producers and consumers which results in accurate load balance. Experimental evaluations show that our scheduler is able to outperform other distributed work stealing schedulers, and to achieve scale beyond what is possible with current approaches.

2015

Compiler Assisted Load Balancing on Large Clusters

Vinit Deodhar, Hrushit Parikh, Ada Gavrilovska, and Santosh Pande

In 2015 International Conference on Parallel Architecture and Compilation (PACT), Oct 2015

Abs

Load balancing of tasks across processing nodes is critical for achieving speed up on large scale clusters. Load balancing schemes typically detect the imbalance and then migrate the load from an overloaded processing node to an idle or lightly loaded processing node and thus, estimation of load critically affects the performance of load balancing schemes. On large scale clusters, the latency of load migration between processing nodes (and the energy) is also a significant overhead and any missteps in load estimation can cause significant migrations and performance losses. Currently, the load estimation is done either by profile or feedback driven approaches in sophisticated systems such as Charm++, but such approaches must be re-thought in light of some workloads such as Adaptive Mesh Refinement (AMR) and multiscale physics where the load variations could be quite dynamic and rapid. In this work we propose a compiler based framework which performs precise prediction of the forthcoming workload. The compiler driven load prediction technique performs static analysis of a task and derives an expression to predict load of a task and hoists it as early as possible in the control flow of execution. The compiler also inserts corrector expressions at strategic program points which refine the reachability probability of the load as well as its estimation. The predictor and the corrector expressions are evaluated at runtime and the predicted load information is refined as the execution proceeds and is eventually used by load balancer to take efficient migration decisions. We present an implementation of the above in the Rose compiler and the Charm++ parallel programming framework. We demonstrate the effectiveness of the framework on some key benchmarks that exhibit dynamic variations and show how the compiler framework assists load balancing schemes in Charm++ to provide significant gains.
PiMiCo: Privacy Preservation via Migration in Collaborative Mobile Clouds

Kaushik Ravichandran, Ada Gavrilovska, and Santosh Pande

In 2015 48th Hawaii International Conference on System Sciences, Jan 2015

Abs

The proliferation of mobile devices and mobile clouds coupled with a multitude of their sensing abilities is creating interesting possibilities, the sensing capabilities are creating different types and fidelities of data in a geographically distributed manner that can be used to build new kinds of peer-to-peer applications. However, the data generated by these mobile devices can be personal and of a highly confidential nature. While very interesting possibilities exist for collaborating on the diverse, shared data in real time, privacy policies on the data sharing, transport, as well as usage must be clearly specified and respected. The goal of this work is to introduce a privacy preserving data centric programming model for building collaborative applications in large scale mobile clouds and discuss its design. Our work introduces several concepts and leverages privacy annotations and a transparent execution migration framework to achieve our goals. We also present an evaluation using several applications demonstrating that overheads are minimal and can be used in a real-time setting.