Posts
May, 4
2nd IEEE Int. Conf. on Control Science and Systems Engineering (CCSSE), 2016
Call for paper download: PDF The 2nd International Conference on Control Science and Systems Engineering (CCSSE 2016) will be held at Singapore, during July 27-29. Control Science and Systems Engineering are entering the second golden phase. These new challenges bring with it new research questions resulting in the need to stimulate the rapid awareness of […]
May, 3
TrimZero: A Torch Recurrent Module for Efficient Natural Language Processing
Deep learning framework supported by CUDA parallel computing platform boosts advances of studies on machine learning. The advantage of parallel processing largely comes from an efficiency of matrix-matrix multiplication using many CUDA-enabled graphics processing units (GPU). Therefore, for recurrent neural networks (RNNs), the usage of a zero-filled matrix representing variable lengths of sentences for a […]
May, 3
Automatic Test Case Reduction for OpenCL
We report on an extension to the C-Reduce tool, for automatic reduction of C test cases, to handle OpenCL kernels. This enables an automated method for detecting bugs in OpenCL compilers, by generating large random kernels using the CLsmith generator, identifying kernels that yield result differences across OpenCL platforms and optimisation levels, and using our […]
May, 3
Fast hyperbolic Radon transform represented as convolutions in log-polar coordinates
The hyperbolic Radon transform is a commonly used tool in seismic processing, for instance in seismic velocity analysis, data interpolation and for multiple removal. A direct implementation by summation of traces with different moveouts is computationally expensive for large data sets. In this paper we present a new method for fast computation of the hyperbolic […]
May, 3
Polly-ACC: Transparent compilation to heterogeneous hardware
Programming today’s increasingly complex heterogeneous hardware is difficult, as it commonly requires the use of data-parallel languages, pragma annotations, specialized libraries, or DSL compilers. Adding explicit accelerator support into a larger code base is not only costly, but also introduces additional complexity that hinders long-term maintenance. We propose a new heterogeneous compiler that brings us […]
May, 3
Exposing Errors Related to Weak Memory in GPU Applications
We present the systematic design of a testing environment that uses stressing and fuzzing to reveal errors in GPU applications that arise due to weak memory effects. We evaluate our approach on seven GPUs spanning three Nvidia architectures, across ten CUDA applications that use fine-grained concurrency. Our results show that applications that rarely or never […]
Apr, 29
Array Program Transformation with Loo.py by Example: High-Order Finite Elements
To concisely and effectively demonstrate the capabilities of our program transformation system Loo.py, we examine a transformation path from two real-world Fortran subroutines as found in a weather model to a single high-performance computational kernel suitable for execution on modern GPU hardware. Along the transformation path, we encounter kernel fusion, vectorization, prefetching, parallelization, and algorithmic […]
Apr, 29
On the design of sparse hybrid linear solvers for modern parallel architectures
In the context of this thesis, our focus is on numerical linear algebra, more precisely on solution of large sparse systems of linear equations. We focus on designing efficient parallel implementations of MaPHyS, an hybrid linear solver based on domain decomposition techniques. First we investigate the MPI+threads approach. In MaPHyS, the first level of parallelism […]
Apr, 29
Automatic Parallelization: Executing Sequential Programs on a Task-Based Parallel Runtime
There are billions of lines of sequential code inside nowadays’ software which do not benefit from the parallelism available in modern multicore architectures. Automatically parallelizing sequential code, to promote an efficient use of the available parallelism, has been a research goal for some time now. This work proposes a new approach for achieving such goal. […]
Apr, 29
Adaptive GPU Array Layout Auto-Tuning
Optimal performance is an important goal in compute intensive applications. For GPU applications, this requires a lot of experience and knowledge about the algorithms and the underlying hardware, making them an ideal target for autotuning approaches. We present an auto-tuner which optimizes array layouts in CUDA applications. Depending on the data and program parameters, kernels […]
Apr, 29
Parallel Subgraph Mining on Hybrid Platforms: HPC Systems, Multi-Cores and GPUs
Frequent subgraph mining (FSM) is an important problem in numerous application areas, such as computational chemistry, bioinformatics, social networks, computer programming languages, etc. However, the problem is computationally hard because it requires enumerating possibly an exponential number of candidate subgraph patterns, and checking their presence in a single large graph or a database of graphs. […]
Apr, 29
A Survey of Cache Bypassing Techniques
With increasing core-count, the cache demand of modern processors has also increased. However, due to strict area/power budgets and presence of poor data-locality workloads, blindly scaling cache capacity is both infeasible and ineffective. Cache bypassing is a promising technique to increase effective cache capacity without incurring power/area costs of a larger sized cache. However, injudicious […]