Graphic processing units (GPUs) are powerful graphics engines featuring high levels of parallelism and extreme memory bandwidth, which constitute a powerful computing platform to solve complex problems involving chemically reacting flows. In the present study, computer programs for combustion simulations with detailed chemical kinetic mechanisms were compiled in the Compute Unified Device Architecture (CUDA) language for NVIDIA GPU architecture. Ignition processes were simulated under constant pressure and constant volume conditions using an explicit 4th order Runge-Kutta algorithm for time integration. Sufficiently small time steps were identified with time scale analysis to ensure the integration stability. The program was validated with the results from simulations with CPUs using detailed mechanisms of various fuels including H2, and CH4. It was found that the GPU-accelerated simulations can be approximately 10-20 times faster than those on CPUs for solving identical problems. Furthermore, the newly implemented GPU solver for detailed chemical kinetics was employed for quasi 2-D simulations.

A novel algorithm is presented to compute the convex hull of a point set in R3using the graphics processing unit (GPU). By exploiting the relationship between the Voronoi diagram and the convex hull, the algorithm derives the approximation of the convex hull from the former. The missed points are found back by using a two-round checking in digital and continuous space successively. The algorithm does not need explicit locking or any other concurrency control mechanism, thus it can maximize the parallelism available on the modern GPU. The implementation using the CUDA programming model on Nvidia GPUs is robust, exact, and efficient. The experiments show that it is up to an order of magnitude faster than other sequential convex hull implementations running on the CPU for inputs of millions of points. The works demonstrate that the GPU can be used to solve non-trivial computational geometry problems with significant performance benefit, without sacrificing accuracy or robustness.

Radiation-hardened processors are designed to be resilient against soft errorsbut such processors are slower than Commercial Off-The-Shelf (COTS)processors as well significantly costlier. In order to mitigate the high costs,software techniques such as task re-executions must be deployed together withadequately hardened processors to provide reliability. This leads to a huge designspace comprising of the hardening level of the processors and the numberof re-executions of each task in the system. Each configuration in this designspace represents a tradeoff between processor load, reliability and costs. The reliability comes at the price of higher costs due to higher levels of hardeningand performance degradation due to hardening or due to re-executions.Thus, the tradeoffs between performance, reliability and costs must be carefullystudied. Pertinent questions that arise in such a design scenario are – (i)how many times a task must be re-executed and (ii) what should be hardeninglevel? – such that the system reliability is satisfied. In order to evaluate such tradeoffs efficiently, in this thesis, we proposenovel framework that harnesses the computational power of Graphics ProcessingUnits (GPUs). Our framework is based on a system failure probabilityanalysis that connects the probability of failure of tasks to the overall systemreliability. Based on characteristics of this probabilistic analysis as well asreal-time deadlines, we derive bounds on the design space to prune infeasiblesolutions. Finally, we illustrate the benefits of our proposed framework withseveral experiments.

Modern computer processing units tend towards simpler cores in greater numbers, favouring the development of data-parallel applications. Evolutionary algorithms are ideal for taking full advantage of SIMD (Single Instruction, Multiple Data) processing, which is available on both CPUs and GPUs. Creating software that runs on a GPU requires the use of specialised programming languages or styles, forcing practitioners to acquire new skills and limiting the portability of their developments. In this paper, we present an automatic translation from ESDL, a domain-specific language for composing evolutionary algorithms from arbitrary operators, to C++ AMP, a C++ extension for targeting heterogeneous hardware. Generating executable code from a simple platform-independent description allows practitioners with varying levels of programming expertise to take advantage of data-parallel execution, and enables those with strong expertise to further optimise their implementations. The automatic transformation is shown to produce code less optimal than a manual implementation but with significantly less developer effort. A secondary result is that GPU implementations require a large population, large individuals or an expensive evaluation function to achieve performance benefits over the CPU. All code developed for this paper is freely available online from http://stevedower.id.au/esdl/amp.

General purpose computing on graphics processing units (GPGPU) consists of using GPUs to handle computations commonly handled by CPUs. GPGPU programming implies developing specific programs to run on GPUs managed by a host program running on the CPU. To achieve high performance implies to explicitly organize memory transfers between devices. Besides, different incompatible frameworks exist making productivity and portability difficult to achieve. In this paper, we describe SPOC, an OCaml library, defining specific data sets in order to automatically manage transfers between GPU and CPU. SPOC also offers a runtime library looking for multiple frameworks and making them usable transparently. We also describe the link between SPOC and the OCaml garbage collector to optimize transfers dynamically. SPOC benchmarks show that SPOC can offer great performance while simplifying GPGPU programming.

The aim of the present paper is to report on our recent results for GPU accelerated simulations of compressible flows. For numerical simulation the adaptive discontinuous Galerkin method with the multidimensional bicharacteristic based evolution Galerkin operator has been used. For time discretization we have applied the explicit third order Runge-Kutta method. Evaluation of the genuinely multidimensional evolution operator has been accelerated using the GPU implementation. We have obtained a speedup up to 30 (in comparison to a single CPU core) for the calculation of the evolution Galerkin operator on a typical discretization mesh consisting of 16384 mesh cells.

The Heisenberg model of classical spins makes use of both Monte Carlo stochastic dynamics as well as time-integration of its equation of motion. These two schemes have different parallelisation strategies and tradeoffs. We implement both algorithms using a data-parallel approach for Graphical Processing Units (GPUs) and we discuss the resulting performance on various combinations of single and multiple GPU. In addition to studying Monte Carlo dynamical update schemes, we use our fast simulation code to explore the scaling and time correlations of a largescale Heisenberg model system using a high-order numerical integration algorithm, which enables study of accurate spin wave phenomena and time-correlation functions. We also discuss various graphical rendering models to appropriately visualise the spin vectors inside an interactive Heisenberg spin simulation.

LSQR (Sparse Equations and Least Squares) is a widely used Krylov subspace method to solve large-scale linear systems in seismic tomography. This paper presents a parallel MPI-CUDA implementation for LSQR solver. On CUDA level, our contributions include: (1) utilize CUBLAS and CUSPARSE to compute major steps in LSQR; (2) optimize memory copy between host memory and device memory; (3) develop a CUDA kernel to perform transpose SpMV without transposing the matrix in memory or preserving additional copy. On MPI level, our contributions include: (1) decompose both matrix and vector to increase parallelism; (2) design a static load balancing strategy. In our experiment, the single GPU code achieves up to 17.6x speedup with 15.7 GFlops in single precision and 15.2x speedup with 12.0 GFlops in double precision compared with the original serial CPU code. The MPI-GPU code achieves up to 3.7x speedup with 268 GFlops in single precision and 3.8x speedup with 223 GFlops in double precision on 135 MPI tasks compared with the corresponding MPI-CPU code. The MPI-GPU code scales on both strong and weak scaling tests. In addition, our parallel implementations have better performance than the LSQR subroutine in PETSc library.

We describe the problem of parallelization of finite difference method (FDM) and finite element method (FEM) computations for certain class of partial differential equations (PDEs) on High Performance Computing (HPC) GPU cluster. For FDM, the structured grids have been employed and optimal data rearrangement operations are performed in GPU computations. For FEM, unstructured triangular and hexahedral meshes are generated and graph partitioning METIS [14] software is used to generate load-balanced sub-domains. The iterative methods have been used to solve result algebraic matrix system of linear equations. A combination of MPI with CUDA and OpenCL enabled NVIDIA as well as OpenCL based AMD-ATI GPUs of HPC GPU Cluster have been used in our experiments [4,6,7,8]. Our experiments indicate that the MPI-CUDA codes based on FDM and FEM achieves nearly 6x speed-ups for large mesh sizes in comparison to host-cpu implementation of the same code. The un-optimized OpenCL implementation GPU times have shown marginal improvement in speed-ups whereas counterpart the CUDA codes achieved maximum speedup of 4x to 6x on HPC GPU Cluster. We presented performance analysis for different mesh sizes that prove performance capabilities of performance and scalability of FDM and FEM computations GPU cluster.

This paper presents an integrated analytical and profile-based CUDA performance modeling approach to accurately predict the kernel execution times of sparse matrix-vector multiplication for CSR, ELL, COO, and HYB SpMV CUDA kernels. Based on our experiments conducted on a collection of 8 widely-used testing matrices on NVIDIA Tesla C2050, the execution times predicted by our model match the measured execution times of NVIDIA’s SpMV implementations very well. Specifically, for 29 out of 32 test cases, the performance differences are under or around 7%. For the rest 3 test cases, the differences are between 8% and 10%. For CSR, ELL, COO, and HYB SpMV kernels, the differences are 4:2%, 5:2%, 1:0%, and 5:7% on the average, respectively.

Implementation of a direct solver for the symmetric positive definite sparse matrix of general structure exploiting the parallelism on the graphic card (GPU). Implementation of a direct solver using the Schur complement specially for the requirements of sparse system in bundle adjustment.

In this paper, we present Compute Unified Device Architecture i.e. CUDA based pyramidal image blending algorithm using an object oriented design patterns. This algorithm is an essential part of an image stitching process for a seamless panoramic mosaic. The CUDA framework is a novel GPU programming framework from NVIDIA. We introduce an object oriented framework for the CUDA based image processing. We illustrate a set of design patterns exploiting programming advantages of an object oriented language; such as encapsulation, code reusability, information hiding, complexity hiding and extensibility. We discuss the framework’s performance in terms of programming efforts, execution overhead and speedup factor achieved over the CPU one. We also talk about programming efforts required for adding the OpenGL Shading Language functionality to the framework.

Page 1 of 47412345...102030...Last »