Posts
Sep, 22
SpMV: A Memory-Bound Application on the GPU Stuck Between a Rock and a Hard Place
In this paper, we investigate the relative merits between GPGPUs and multicores in the context of sparse matrix-vector multiplication (SpMV). While GPGPUs possess impressive capabilities in terms of raw compute throughput and memory bandwidth, their performance varies significantly with application tuning as well as sparse input and format characteristics. Furthermore, several emerging technological and workload […]
Sep, 22
Overlapping computation and communication of three-dimensional FDTD on a GPU cluster
Large-scale electromagnetic field simulations using the FDTD (finite-difference time-domain) method require the use of GPU (graphics processing unit) clusters. However, the communication overhead caused by slow interconnections becomes a major performance bottleneck. In this paper, as a way to remove the bottleneck, we propose the "kernel-split method" and the "host-buffer method" which overlap computation and […]
Sep, 22
Exploration of Parallelization Frameworks for Computational Finance
This paper presents a comparison of parallelization frameworks for efficient execution of computational finance workloads. We use a Value-at-Risk (VaR) workload to evaluate OpenCL and OpenMP parallelization frameworks on multi-core CPUs as opposed to GPUs. In addition, we study the impact of SMT on performance using GCC (4.4) and IBM XLC (11.01) compilers for both […]
Sep, 22
Modification of self-organizing migration algorithm for OpenCL framework
This paper deals with modification of self-organizing migration algorithm using the OpenCL framework. This modification allows the algorithm to exploit modern parallel devices, like central processing units and graphics processing units. The main aim was to create algorithm which shows significant speedup when compared to sequential variant. Second aim was to create the algorithm robust […]
Sep, 21
Large-Scale Motion Modelling using a Graphical Processing Unit
The increased availability of Graphical Processing Units (GPUs) in personal computers has made parallel programming worthwhile and more accessible, but not necessarily easier. This thesis will take advantage of the power of a GPU, in conjunction with the Central Processing Unit (CPU), in order to simulate target trajectories for large-scale scenarios, such as wide-area maritime […]
Sep, 21
Some examples of instant computations of fluid dynamics on GPU
This paper is a summary of our experience feedback on GPU and GPGPU computing for two-dimensional computational fluid dynamics using fine grids and three-dimensional kinetic transport problems. The choice of the computational approach is clearly critical for both performance speedup and efficiency. In our numerical experiments, we used a Lattice Boltzmann approach (LBM) for the […]
Sep, 21
Parallelization of Hierarchical Text Clustering on Multi-core CUDA Architecture
Text Clustering is the problem of dividing text documents into groups, such that documents in same group are similar to one another and different from documents in other groups. Because of the general tendency of texts forming hierarchies, text clustering is best performed by using a hierarchical clustering method. An important aspect while clustering large […]
Sep, 21
Fast and Efficient Automatic Memory Management for GPUs using Compiler-Assisted Runtime Coherence Scheme
Exploiting the performance potential of GPUs requires managing the data transfers to and from them efficiently which is an errorprone and tedious task. In this paper, we develop a software coherence mechanism to fully automate all data transfers between the CPU and GPU without any assistance from the programmer. Our mechanism uses compiler analysis to […]
Sep, 21
Autotuning Wavefront Abstractions for Heterogeneous Architectures
We present our autotuned heterogeneous parallel programming abstraction for the wavefront pattern. An exhaustive search of the tuning space indicates that correct setting of tuning factors can average 37x speedup over a sequential baseline. Our best automated machine learning based heuristic obtains 92% of this ideal speedup, averaged across our full range of wavefront examples.
Sep, 20
Charged particles constrained to a curved surface
We study the motion of charged particles constrained to arbitrary two-dimensional curved surfaces but interacting in three-dimensional space via the Coulomb potential. To speed-up the interaction calculations, we use the parallel compute capability of the Compute Unified Device Architecture (CUDA) of todays graphics boards. The particles and the curved surfaces are shown using the Open […]
Sep, 20
Evolutionary Clustering on CUDA
Unsupervised clustering of large data sets is a complicated task. Due to its complexity, various meta-heuristic machine learning algorithms have been used to automate the clustering process. Genetic and evolutionary algorithms have been deployed to find clusters in data sets with success. The GPU computing is a recent programming paradigm introducing high performance parallel computing […]
Sep, 20
Binaural Simulations Using Audio Rate FDTD Schemes and CUDA
Three dimensional finite difference time domain schemes can be used as an approach to spatial audio simulation. By embedding a model of the human head in a 3D computational space, such simulations can emulate binaural sound localisation. This approach normally relies on using high sample rates to give finely detailed models, and is computationally intensive. […]