Posts
Aug, 1
A Scalable Approach to Solving Dense Linear Algebra Problems on Hybrid CPU-GPU Systems
Aiming to fully exploit the computing power of all CPUs and all GPUs on hybrid CPU-GPU systems to solve dense linear algebra problems, we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, as well as to accommodate the heterogeneity between CPUs and GPUs. The new […]
Aug, 1
Optimizing performance per watt on GPUs in High Performance Computing: temperature, frequency and voltage effects
The magnitude of the real-time digital signal processing challenge attached to large radio astronomical antenna arrays motivates use of high performance computing (HPC) systems. The need for high power efficiency (performance per watt) at remote observatory sites parallels that in HPC broadly, where efficiency is an emerging critical metric. We investigate how the performance per […]
Aug, 1
Discriminative Convolutional Sum-Product Networks on GPU
Sum-Product Networks (SPNs) are a deep architecture recently proposed for image classification and modeling. In contrast to loopy graphical models commonly used in computer vision, exact inference and learning in SPNs is tractable. As long as consistency and completeness are ensured, an SPN allows to efficiently calculate the partition function and all marginals of graphical […]
Jul, 30
Automatic Parallelization of Tiled Stencil Loop Nests on GPUs
This thesis attempts to design and implement a compiler framework based on the polyhedral model. The compiler automatically parallelizes loop nests; especially stencil kernels, into efficient GPU code by loop tiling transformations which the polyhedral model describes. To enhance parallel performance, we introduce three practically efficient techniques to process different types of loop nests. The […]
Jul, 30
Dynamic Data Management Among Multiple Databases for Optimization of Parallel Computations in Heterogeneous HPC Systems
Rapid development of diverse computer architectures and hardware accelerators caused that designing parallel systems faces new problems resulting from their heterogeneity. Our implementation of a parallel system called KernelHive allows to efficiently run applications in a heterogeneous environment consisting of multiple collections of nodes with different types of computing devices. The execution engine of the […]
Jul, 30
Scaling Multifluid Compressible Fluid Dynamics to 700,000 cores, 1.5 Pflop/s, and a Trillion Grid Cells
We are using the Blue Waters system at NCSA to study compressible, turbulent mixing of gases in the deep interiors of stars and also in the context of inertial confinement fusion (ICF). In December, 2012, during the Blue Waters friendly user access period, we carried out a simulation of an ICF test problem on a […]
Jul, 30
Research on Parallel DVH Statistic Based on CUDA
Dose Volume Histogram(DVH) is necessary for evaluating radiotherapy planning. With the increase of patient CT slices and the development of intensity-modulated radiation therapy(IMRT) technology, statistical process of DVH requires a large number of cubic interpolation calculation, and the sequential single threaded DVH code on the CPU can not meet the real-time requirement. The paper presents […]
Jul, 30
A CUDA-Based Real Parameter Optimization Benchmark
Benchmarking is key for developing and comparing optimization algorithms. In this paper, a CUDA-based real parameter optimization benchmark (cuROB) is introduced. Test functions of diverse properties are included within cuROB and implemented efficiently with CUDA. Speedup of one order of magnitude can be achieved in comparison with CPU-based benchmark of CEC’14.
Jul, 29
Optimizing Lempel-Ziv Factorization for the GPU Architecture
Lossless data compression is used to reduce storage requirements, allowing for the relief of I/O channels and better utilization of bandwidth. The Lempel-Ziv lossless compression algorithms form the basis for many of the most commonly used compression schemes. General purpose computing on graphic processing units (GPGPUs) allows us to take advantage of the massively parallel […]
Jul, 29
Implicit Methods for Real-Time simulation of Interactive Waves
The project focuses on developing a simulator in which ships and waves interact. The new wave model is the Variational Boussinesq model (VBM). However, this new realistic model brings much more computation effort with it. The VBM mainly requires an unsteady state solver, that solves a coupled system of equations at each frame (20 fps). […]
Jul, 29
Parallel Worldline Numerics: Implementation and Error Analysis
We give an overview of the worldline numerics technique, and discuss the parallel CUDA implementation of a worldline numerics algorithm. In the worldline numerics technique, we wish to generate an ensemble of representative closed-loop particle trajectories, and use these to compute an approximate average value for Wilson loops. We show how this can be done […]
Jul, 29
Mixed-precision orthogonalization scheme and its case studies with CA-GMRES on a GPU
We propose a mixed-precision orthogonalization scheme that takes the input matrix in a standard 32 or 64-bit floating-point precision, but uses higher-precision arithmetics to accumulate its intermediate results. For the 64-bit precision, our scheme uses software emulation for the higher-precision arithmetics, and requires about 20x more computation but about the same amount of communication as […]

