Posts
May, 3
Tools for GPU Computing – Debugging and Performance Analysis of Heterogenous HPC Applications
General purpose GPUs are now ubiquitous in high-end supercomputing. All but one (the Japanese Fugaku system, which is based on ARM processors) of the announced (pre-)exascale systems contain vast amounts of GPUs that deliver the majority of the performance of these systems. Thus, GPU programming will be a necessity for application developers using high-end HPC […]
May, 3
AutoParBench: A Unified Test Framework for OpenMP-based Parallelizers
This paper describes AutoParBench, a framework to test OpenMP-based automatic parallelization tools. The core idea of this framework is a common representation, called a "JSON snapshot", that normalizes the output produced by auto-parallelizers. By converting—automatically—this output to the common representation, AutoParBench lets us compare auto-parallelizers among themselves, and compare them semantically against a reference collection. […]
May, 3
Leveraging Data-Flow Information for Efficient Scheduling of Task-Parallel Programs on Heterogeneous Systems
Writing efficient programs for heterogeneous platforms is challenging: programmers must deal with multiple programming models, partition work for CPUs and accelerators with different compute capabilities, requiring different amounts of parallelism, and manage memory in multiple distinct address spaces. Consequently, programming models which only require expressing parallelism and data dependences can not only unburden the programmer […]
May, 3
Tools for Reduced Precision Computation: A Survey
The use of reduced precision to improve performance metrics such as computation latency and power consumption is a common practice in the embedded systems field. This practice is emerging as a new trend in High Performance Computing (HPC), especially when new error-tolerant applications are considered. However, standard compiler frameworks do not support automated precision customization, […]
May, 3
86 PFLOPS Deep Potential Molecular Dynamics simulation of 100 million atoms with ab initio accuracy
We present the GPU version of DeePMD-kit, which, upon training a deep neural network model using ab initio data, can drive extremely large-scale molecular dynamics (MD) simulation with ab initio accuracy. Our tests show that the GPU version is 7 times faster than the CPU version with the same power consumption. The code can scale […]
May, 2
cuda-kat: The CUDA Kernel Author’s Toolkit
An install-less, header-only library which is a loosely-coupled collection of utility functions and classes for writing device-side CUDA code (kernels and non-kernel functions). These let us: * Write templated device-side without constantly coming up against not-trivially-templatable bits. * Use standard-library(-like) containers in device-side code (but not have to use them). * Not repeat ourselves as […]
Apr, 26
Automatic Parallelization for Heterogeneous Embedded Systems
Recent years have seen an increase of heterogeneous architectures combining multi-core CPUs with accelerators such as GPU, FPGA, and Intel Xeon Phi. GPU can achieve significant performance for certain categories of application. Nevertheless, achieving this performance with low-level APIs (e.g. CUDA, OpenCL) requires to rewrite the sequential code, to have a good knowledge of GPU […]
Apr, 26
Accelerating Winograd Convolutions using Symbolic Computation and Meta-programming
Convolution operations are essential constituents of convolutional neural networks. Their efficient and performance-portable implementation demands tremendous programming effort and fine-tuning. Winograd’s minimal filtering algorithm is a well-known method to reduce the computational complexity of convolution operations. Unfortunately, existing implementations of this algorithm are either vendor-specific or hard-coded to support a small subset of convolutions, thus […]
Apr, 26
GEVO: GPU Code Optimization using Evolutionary Computation
GPUs are a key enabler of the revolution in machine learning and high performance computing, functioning as de facto co-processors to accelerate large-scale computation. As the programming stack and tool support have matured, GPUs have also become accessible to programmers, who may lack detailed knowledge of the underlying architecture and fail to fully leverage the […]
Apr, 26
Evaluating FPGA Accelerator Performance with a Parameterized OpenCL Adaptation of the HPCChallenge Benchmark Suite
FPGAs have found increasing adoption in data center applications since a new generation of high-level tools have become available which noticeably reduce development time for FPGA accelerators and still provide high quality of results. There is however no high-level benchmark suite available which specifically enables a comparison of FPGA architectures, programming tools and libraries for […]
Apr, 26
Cpp-Taskflow: A General-purpose Parallel and Heterogeneous Task Programming System at Scale
The Cpp-Taskflow project addresses the long-standing question: How can we make it easier for developers to write parallel and heterogeneous programs with high performance and simultaneous high productivity? Cpp-Taskflow develops a simple and powerful task programming model to enable efficient implementations of heterogeneous decomposition strategies. Our programming model empowers users with both static and dynamic […]
Apr, 19
OpenCL-Darknet: implementation and optimization of OpenCL-based deep learning object detection framework
Object detection is a technology that deals with recognizing classes of objects and their location. It is used in many different areas, such as in face-detecting systems [16, 34, 37], surveillance tools [9], human-machine interfaces [17], and self-driving cars [18, 23, 25, 26, 30]. These days, deep learning object detection approaches have achieved significantly better […]