Posts
Mar, 21
Porting a sparse linear algebra math library to Intel GPUs
With the announcement that the Aurora Supercomputer will be composed of general purpose Intel CPUs complemented by discrete high performance Intel GPUs, and the deployment of the oneAPI ecosystem, Intel has committed to enter the arena of discrete high performance GPUs. A central requirement for the scientific computing community is the availability of production-ready software […]
Mar, 21
GPGPU Task Scheduling Technique for Reducing the Performance Deviation of Multiple GPGPU Tasks in RPC-Based GPU Virtualization Environments
In remote procedure call (RPC)-based graphic processing unit (GPU) virtualization environments, GPU tasks requested by multiple-user virtual machines (VMs) are delivered to the VM owning the GPU and are processed in a multi-process form. However, because the thread executing the computing on general GPUs cannot arbitrarily stop the task or trigger context switching, GPU monopoly […]
Mar, 21
Physics and Computing Performance of the Exa.TrkX TrackML Pipeline
The Exa.TrkX project has applied geometric learning concepts such as metric learning and graph neural networks to HEP particle tracking. The Exa.TrkX tracking pipeline clusters detector measurements to form track candidates and filters them. The pipeline, originally developed using the TrackML dataset (a simulation of an LHC-like tracking detector), has been demonstrated on various detectors, […]
Mar, 21
AVEC: Accelerator Virtualization in Cloud-Edge Computing for Deep Learning Libraries
Edge computing offers the distinct advantage of harnessing compute capabilities on resources located at the edge of the network to run workloads of relatively weak user devices. This is achieved by offloading computationally intensive workloads, such as deep learning from user devices to the edge. Using the edge reduces the overall communication latency of applications […]
Mar, 21
A Deep Learning Based Cost Model for Automatic Code Optimization
Enabling compilers to automatically optimize code has been a longstanding goal for the compiler community. Efficiently solving this problem requires using precise cost models. These models predict whether applying a sequence of code transformations reduces the execution time of the program. Building an analytical cost model to do so is hard in modern x86 architectures […]
Mar, 19
Reviewing GPU architectures to build efficient back projection for parallel geometries
Back-Projection is the major algorithm in Computed Tomography to reconstruct images from a set of recorded projections. It is used for both fast analytical methods and high-quality iterative techniques. X-ray imaging facilities rely on Back-Projection to reconstruct internal structures in material samples and living organisms with high spatial and temporal resolution. Fast image reconstruction is […]
Mar, 14
DISC: A Dynamic Shape Compiler for Machine Learning Workloads
Many recent machine learning models show dynamic shape characteristics. However, existing AI compiler optimization systems suffer a lot from problems brought by dynamic shape models, including compilation overhead, memory usage, optimization pipeline and deployment complexity. This paper provides a compiler system to natively support optimization for dynamic shape workloads, named DISC. DISC enriches a set […]
Mar, 14
Transparent Checkpointing for OpenGL Applications on GPUs
This work presents transparent checkpointing of OpenGL applications, refining the split-process technique[1] for application in GPU-based 3D graphics. The split-process technique was earlier applied to checkpointing MPI and CUDA programs, enabling reinitialization of driver libraries. The presented design targets practical, checkpoint-package agnostic checkpointing of OpenGL applications. An early prototype is demonstrated on Autodesk Maya. Maya […]
Mar, 14
A Compiler Infrastructure for Accelerator Generators
We present Calyx, a new intermediate language (IL) for compiling high-level programs into hardware designs. Calyx combines a hardware-like structural language with a software-like control flow representation with loops and conditionals. This split representation enables a new class of hardware-focused optimizations that require both structural and control flow information which are crucial for high-level programming […]
Mar, 14
ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX
The A64FX CPU is arguably the most powerful Arm-based processor design to date. Although it is a traditional cache-based multicore processor, its peak performance and memory bandwidth rival accelerator devices. A good understanding of its performance features is of paramount importance for developers who wish to leverage its full potential. We present an architectural analysis […]
Mar, 14
hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices
Accessible machine learning algorithms, software, and diagnostic tools for energy-efficient devices and systems are extremely valuable across a broad range of application domains. In scientific domains, real-time near-sensor processing can drastically improve experimental design and accelerate scientific discoveries. To support domain scientists, we have developed hls4ml, an open-source software-hardware codesign workflow to interpret and translate […]
Mar, 7
LazyTensor: combining eager execution with domain-specific compilers
Domain-specific optimizing compilers have demonstrated significant performance and portability benefits, but require programs to be represented in their specialized IRs. Existing frontends to these compilers suffer from the "language subset problem" where some host language features are unsupported in the subset of the user’s program that interacts with the domain-specific compiler. By contrast, define-by-run ML […]