## Posts

Nov, 5

### Designing efficient sorting algorithms for manycore GPUs

We describe the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort reported in the literature, and is up to 4 times faster than the graphics-based GPUSort. It is also highly competitive with CPU […]

Nov, 5

### Best-effort semantic document search on GPUs

Semantic indexing is a popular technique used to access and organize large amounts of unstructured text data. We describe an optimized implementation of semantic indexing and document search on manycore GPU platforms. We observed that a parallel implementation of semantic indexing on a 128-core Tesla C870 GPU is only 2.4X faster than a sequential implementation […]

Nov, 5

### Sparse matrix solvers on the GPU: conjugate gradients and multigrid

Many computer graphics applications require high-intensity numerical simulation. We show that such computations can be performed efficiently on the GPU, which we regard as a full function streaming processor with high floating-point performance. We implemented two basic, broadly useful, computational kernels: a sparse matrix conjugate gradient solver and a regular-grid multigrid solver. Real time applications […]

Nov, 5

### A Survey of General-Purpose Computation on Graphics Hardware

The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability, have made graphics hardware a compelling platform for computationally demanding tasks in a wide variety of application domains. In this report, we describe, summarize, and analyze the latest research in mapping general-purpose computation to graphics hardware. We begin with […]

Nov, 5

### Accelerator: using data parallelism to program GPUs for general-purpose uses

GPUs are difficult to program for general-purpose uses. Programmers can either learn graphics APIs and convert their applications to use graphics pipeline operations or they can use stream programming abstractions of GPUs. We describe Accelerator, a system that uses data parallelism to program GPUs for general-purpose uses instead. Programmers use a conventional imperative programming language […]

Nov, 5

### NVIDIA Tesla: A Unified Graphics and Computing Architecture

To enable flexible, programmable graphics and high-performance computing, NVIDIA has developed the Tesla scalable unified graphics and parallel computing architecture. Its scalable parallel array of processors is massively multithreaded and programmable in C or via graphics APIs.

Nov, 5

### A translation system for enabling data mining applications on GPUs

Modern GPUs offer much computing power at a very modest cost. Even though CUDA and other related recent developments are accelerating the use of GPUs for general purpose applications, several challenges still remain in programming the GPUs. Thus, it is clearly desirable to be able to program GPUs using a higher-level interface. In this paper, […]

Nov, 5

### A Task Parallel Algorithm for Computing the Costs of All-Pairs Shortest Paths on the CUDA-Compatible GPU

This paper proposes a fast method for computing the costs of all-pairs shortest paths (APSPs) on the graphics processing unit (GPU). The proposed method is implemented using compute unified device architecture (CUDA), which offers us a development environment for performing general-purpose computation on the GPU. Our method is based on Harish’s iterative algorithm that computes […]

Nov, 5

### Lattice SU(2) on GPU’s

We discuss the CUDA approach to the simulation of pure gauge Lattice SU(2). CUDA is a hardware and software architecture developed by NVIDIA for computing on the GPU. We present an analysis and performance comparison between the GPU and CPU with single precision. Analysis with single and multiple GPU’s, using CUDA and OPENMP, are also […]

Nov, 5

### ECM on Graphics Cards

This paper reports record-setting performance for the elliptic-curve method of integer factorization: for example, 926.11 curves/second for ECM stage 1 with B1=8192 for 280-bit integers on a single PC. The state-of-the-art GMP-ECM software handles 124.71 curves/second for ECM stage 1 with B1=8192 for 280-bit integers using all four cores of a 2.4 GHz Core 2 Quad […]

Nov, 5

### Realistic real-time sound re-synthesis and processing for interactive virtual worlds

We present new GPU-based techniques for implementing linear digital filters for real-time audio processing. Our solution for recursive filters is the first presented in the literature. We demonstrate the relevance of these algorithms to computer graphics by synthesizing realistic sounds of colliding objects made of different materials, such as glass, plastic, and wood, in real […]

Nov, 5

### Solving Path Problems on the GPU

We consider the computation of shortest paths on Graphic Processing Units (GPUs). The blocked recursive elimination strategy we use is applicable to a class of algorithms (such as all-pairs shortest-paths, transitive closure, and LU decomposition without pivoting) having similar data access patterns. Using the all-pairs shortest-paths problem as an example, we uncover potential gains over […]