## Posts

Oct, 31

### Using an OpenCL Framework to Evaluate Interconnect Implementations on FPGAs

Field Programmable Gate Arrays (FPGAs) are an ideal platform for building systems with custom hardware accelerators, however managing these systems is still a major challenge. The OpenCL standard has become accepted as a good programming model for managing heterogeneous platforms due to its rich constructs. Although commercial OpenCL frameworks are now emerging, there is a […]

Oct, 29

### Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units

This paper presents an implementation of different matrix-matrix multiplication routines in OpenCL. We utilize the high-performance GEMM (GEneral Matrix-Matrix Multiply) implementation from our previous work for the present implementation of other matrix-matrix multiply routines in Level-3 BLAS (Basic Linear Algebra Subprograms). The other routines include SYMM (Symmetric Matrix-Matrix Multiply), SYRK (Symmetric Rank-K Update), SYR2K (Symmetric […]

Oct, 29

### Sparse Recovery on GPUs: Accelerating the Iterative Soft-Thresholding Algorithm

Solving linear inverse problems where the solution is known to be sparse is of interest to both signal processing and machine learning research. The standard algorithms for solving such problems are sequential in nature – they tend to be slow for large scale problems. In the past, researchers have used Graphics Processing Units to accelerate […]

Oct, 29

### Efficient Particle-Mesh Spreading on GPUs

The particle-mesh spreading operation maps a value at an arbitrary particle position to contributions at regular positions on a mesh. This operation is often used when a calculation involving irregular positions is to be performed in Fourier space. We study several approaches for particle mesh spreading on GPUs. A central concern is the use of […]

Oct, 29

### Performance Modeling, Optimization, and Characterization on Heterogeneous Architectures

Today, heterogeneous computing has truly reshaped the way scientists think and approach high-performance computing (HPC). Hardware accelerators such as general-purpose graphics processing units (GPUs) and Intel Many Integrated Core (MIC) architecture continue to make in-roads in accelerating large-scale scientific applications. These advancements, however, introduce new sets of challenges to the scientific community such as: selection […]

Oct, 29

### Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging

We describe the neural-network training framework used in the Kaldi speech recognition toolkit, which is geared towards training DNNs with large amounts of training data using multiple GPU-equipped or multi-core machines. In order to be as hardware-agnostic as possible, we needed a way to use multiple machines without generating excessive network traffic. Our method is […]

Oct, 27

### Testing and Exposing Weak Graphics Processing Unit Memory Models

Graphics Processing Units (GPUs) are highly parallel shared memory microprocessors, and as such, they are prone to the same concurrency considerations as their traditional multicore CPU counterparts. In this thesis, we consider shared memory consistency, i.e. what values can be read when issued concurrently with writes on current GPU hardware. While memory consistency has been […]

Oct, 27

### Parallel Finite Volume Algorithm on Graphic Processing Units (GPU)

Capabilities of using Graphic Processing Units (GPU) as a computational tool in CFD have been investigated here. Several solvers for solving linear matrix equations have been benchmarked on GPU and is shown that Gauss-Seidle gives the best performance for the GPU architecture. Compared to CPU on a case of lid-driven cavity flow, speedups of up […]

Oct, 27

### Bayesian Neural Networks in Data-Intensive High Energy Physics Applications

This dissertation studies a graphical processing unit (GPU) construction of Bayesian neural networks (BNNs) using large training data sets. The goal is to create a program for the mapping of phenomenological Minimal Supersymmetric Standard Model (pMSSM) parameters to their predictions. This would allow for a more robust method of studying the Minimal Supersymmetric Standard Model, […]

Oct, 27

### Finding Longest Common Subsequences by GPU-Based Parallel Ant Colony Optimization

The longest common subsequence (LCS) problem is one of the classic problems in string processing. It is commonly used in file comparison, pattern recognition, and computational biology as a measure of sequence similarity. Given a set of strings, the LCS is the longest string that is a subsequence of every string in the set. For […]

Oct, 27

### Contract-Based General-Purpose GPU Programming

Using GPUs as general-purpose processors has revolutionized parallel computing by offering, for a large and growing set of algorithms, massive data-parallelization on desktop machines. As an obstacle to widespread adoption, programming GPUs has remained difficult due to the need of using low-level control of the hardware to achieve good performance. This paper suggests a programming […]

Oct, 25

### GPGPU Acceleration for Skeletal Animation-comparing OpenCL with CUDA and GLSL

The existing matrix palette algorithms for skeletal animation are accelerated by the technique GPGPU based on GLSL or CUDA. Because GLSL is extended from graphics library OpenGL, it couples the rendering and calculations together closely and forces itself not convenient to reuse, meanwhile CUDA is designed only for NVIDIA GPUs. In this paper GPGPU based […]