## Posts

Apr, 5

### LS-CAT: A Large-Scale CUDA AutoTuning Dataset

The effectiveness of Machine Learning (ML) methods depend on access to large suitable datasets. In this article, we present how we build the LS-CAT (Large-Scale CUDA AutoTuning) dataset sourced from GitHub for the purpose of training NLP-based ML models. Our dataset includes 19 683 CUDA kernels focused on linear algebra. In addition to the CUDA […]

Apr, 5

### Energy-aware Task Scheduling with Deadline Constraint in DVFS-enabled Heterogeneous Clusters

Energy conservation of large data centers for high-performance computing workloads, such as deep learning with big data, is of critical significance, where cutting down a few percent of electricity translates into million-dollar savings. This work studies energy conservation on emerging CPU-GPU hybrid clusters through dynamic voltage and frequency scaling (DVFS). We aim at minimizing the […]

Mar, 28

### Enabling OpenMP Task Parallelism on Multi-FPGAs

FPGA-based hardware accelerators have received increasing attention mainly due to their ability to accelerate deep pipelined applications, thus resulting in higher computational performance and energy efficiency. Nevertheless, the amount of resources available on even the most powerful FPGA is still not enough to speed up very large modern workloads. To achieve that, FPGAs need to […]

Mar, 28

### Accelerating Deep Neural Networks implementation: A survey

Recently, Deep Learning (DL) applications are getting more and more involved in different fields. Deploying such Deep Neural Networks (DNN) on embedded devices is still a challenging task considering the massive requirement of computation and storage. Given that the number of operations and parameters increases with the complexity of the model architecture, the performance will […]

Mar, 28

### Accelerating Sparse Approximate Matrix Multiplication on GPUs

Although the matrix multiplication plays a vital role in computational linear algebra, there are few efficient solutions for matrix multiplication of the near-sparse matrices. The Sparse Approximate Matrix Multiply (SpAMM) is one of the algorithms to fill the performance gap neglected by traditional optimizations for dense/sparse matrix multiplication. However, existing SpAMM algorithms fail to exploit […]

Mar, 28

### De-specializing an HLS library for Deep Neural Networks: improvements upon hls4ml

Custom hardware accelerators for Deep Neural Networks are increasingly popular: in fact, the flexibility and performance offered by FPGAs are well-suited to the computational effort and low latency constraints required by many image recognition and natural language processing tasks. The gap between high-level Machine Learning frameworks (e.g., Tensorflow, Pytorch) and low-level hardware design in Verilog/VHDL […]

Mar, 28

### CUDA Tutorial – Cryptanalysis of Classical Ciphers Using Modern GPUs and CUDA

CUDA (formerly an abbreviation of Compute Unified Device Architecture) is a parallel computing platform and API model created by Nvidia allowing software developers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing. Throughout this tutorial, we introduce the CUDA concepts in an easy-to-grasp interactive way. Starting from scratch, we implement a complete […]

Mar, 21

### Porting a sparse linear algebra math library to Intel GPUs

With the announcement that the Aurora Supercomputer will be composed of general purpose Intel CPUs complemented by discrete high performance Intel GPUs, and the deployment of the oneAPI ecosystem, Intel has committed to enter the arena of discrete high performance GPUs. A central requirement for the scientific computing community is the availability of production-ready software […]

Mar, 21

### GPGPU Task Scheduling Technique for Reducing the Performance Deviation of Multiple GPGPU Tasks in RPC-Based GPU Virtualization Environments

In remote procedure call (RPC)-based graphic processing unit (GPU) virtualization environments, GPU tasks requested by multiple-user virtual machines (VMs) are delivered to the VM owning the GPU and are processed in a multi-process form. However, because the thread executing the computing on general GPUs cannot arbitrarily stop the task or trigger context switching, GPU monopoly […]

Mar, 21

### Physics and Computing Performance of the Exa.TrkX TrackML Pipeline

The Exa.TrkX project has applied geometric learning concepts such as metric learning and graph neural networks to HEP particle tracking. The Exa.TrkX tracking pipeline clusters detector measurements to form track candidates and filters them. The pipeline, originally developed using the TrackML dataset (a simulation of an LHC-like tracking detector), has been demonstrated on various detectors, […]

Mar, 21

### AVEC: Accelerator Virtualization in Cloud-Edge Computing for Deep Learning Libraries

Edge computing offers the distinct advantage of harnessing compute capabilities on resources located at the edge of the network to run workloads of relatively weak user devices. This is achieved by offloading computationally intensive workloads, such as deep learning from user devices to the edge. Using the edge reduces the overall communication latency of applications […]

Mar, 21

### A Deep Learning Based Cost Model for Automatic Code Optimization

Enabling compilers to automatically optimize code has been a longstanding goal for the compiler community. Efficiently solving this problem requires using precise cost models. These models predict whether applying a sequence of code transformations reduces the execution time of the program. Building an analytical cost model to do so is hard in modern x86 architectures […]