Posts
Aug, 19
Kernel Tuner: A search-optimizing GPU code auto-tuner
A very common problem in GPU programming is that some combination of thread block dimensions and other code optimization parameters, like tiling or unrolling factors, results in dramatically better performance than other kernel configurations. To obtain highly-efficient kernels it is often required to search vast and discontinuous search spaces that consist of all possible combinations […]
Aug, 5
GPU schedulers: how fair is fair enough?
Blocking synchronisation idioms, e.g. mutexes and barriers, play an important role in concurrent programming. However, systems with semi-fair schedulers, e.g. graphics processing units (GPUs), are becoming increasingly common. Such schedulers provide varying degrees of fairness, guaranteeing enough to allow some, but not all, blocking idioms. While a number of applications that use blocking idioms do […]
Jul, 1
Directive-Based, High-Level Programming and Optimizations for High-Performance Computing with FPGAs
Reconfigurable architectures like Field Programmable Gate Arrays (FPGAs) have been used for accelerating computations from several domains because of their unique combination of flexibility, performance, and power efficiency. However, FPGAs have not been widely used for high-performance computing, primarily because of their programming complexity and difficulties in optimizing performance. In this paper, we present a […]
Jul, 1
Compiler Fuzzing through Deep Learning
Random program generation – fuzzing – is an effective technique for discovering bugs in compilers but successful fuzzers require extensive development effort for every language supported by the compiler, and often leave parts of the language space untested. We introduce DeepSmith, a novel machine learning approach to accelerating compiler validation through the inference of generative […]
Jun, 24
Synthesis of GPU Programs from High-Level Models
Modern graphics processing units (GPUs) provide high-performance general purpose computation abilities. They have massive parallel architectures that are suitable for executing parallel algorithms and operations. They are also throughput-oriented devices that are optimized to achieve high throughput for stream processing. Designing efficient GPU programs is a notoriously difficult task. The ForSyDe methodology is suitable to […]
Jun, 24
Strategies for the Heterogeneous Execution of Large-Scale Simulations on Hybrid Supercomputers
Massively-parallel devices of various architectures are being adopted by the newest supercomputers to overcome the actual power constraint in the context of the exascale challenge. This progress leads to an increasing hybridisation of HPC systems and makes the design of computing applications a rather complex problem. Therefore, the software efficiency and portability are of crucial […]
Jun, 24
DeepSmith: Compiler Fuzzing through Deep Learning
Random program generation – fuzzing – is an effective technique for discovering bugs in compilers but successful fuzzers require extensive development effort for every language supported by the compiler, and often leave parts of the language space untested. We introduce DeepSmith, a novel machine learning approach to accelerating compiler validation through the inference of generative […]
Jun, 13
Aspect-Driven Mixed-Precision Tuning Targeting GPUs
Writing mixed-precision kernels allows to achieve higher throughput together with outputs whose precision remain within given limits. The recent introduction of native half-precision arithmetic capabilities in several GPUs, such as NVIDIA P100 and AMD Vega 10, contributes to make precision-tuning even more relevant as of late. However, it is not trivial to manually find which […]
Jun, 5
The Third International Workshop on GPU Computing and AI (GCA), 2018
==================================================== The Third International Workshop on GPU Computing and AI (GCA) http://is-candar.org/GCA18/ to be held in conjunction with The Sixth International Symposium on Computing and Networking (CANDAR’18), Hida Takayama, Japan, November 27-30, 2018 http://is-candar.org/ ==================================================== [Introduction] Built for massive parallelism, General Purpose computing on Graphic Processing Unit (GPGPU) has superseded high-performance CPU in several important […]
Jun, 2
clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization
Alternating least squares (ALS) has been proved to be an effective solver for matrix factorization in recommender systems. To speed up factorizing performance, various parallel ALS solvers have been proposed to leverage modern multi-cores and many-cores. Existing implementations are limited in either speed or portability. In this paper, we present an efficient and portable ALS […]
May, 26
Transformations of High-Level Synthesis Codes for High-Performance Computing
Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to […]
May, 12
EngineCL: Usability and Performance in Heterogeneous Computing
Heterogeneous systems composed by a CPU and a set of hardware accelerators have become one of the most common architectures today, thanks to their excellent performance and energy consumption. However, due to their heterogeneity they are very complex to program and even more to achieve performance portability on different devices. This paper presents EngineCL, a […]