high performance computing on graphics processing units: hgpu.org

Posts

Dec, 15

A Survey Of Techniques for Cache Locking

Cache memory, although important for boosting application performance, is also a source of execution time variability, and this makes its use difficult in systems requiring worst case execution time (WCET) guarantees. Cache locking is a promising approach for simplifying WCET estimation and providing predictability and hence, several commercial processors provide ability for locking cache. However, […]

Dec, 14

Free-form interest rate term structure decomposition: a 2nd order optimization problem

The paper discusses an interest rate term structure decomposition method that breaks from the conventional, in that it does not superimpose any model, form or structure on the decomposition output – hence, the term free-form. The premise is simple: if the model does not presuppose any structure beforehand, and if the structure underlying the input […]

Dec, 12

A Scalable Lane Detection Algorithm on COTSs with OpenCL

Road lane detection are classical requirements for advanced driving assistant systems. With new computer technologies, lane detection algorithms can be exploited on COTS platforms. This paper investigates the use of OpenCL and develop a particle-filter based lane detection algorithm that can tune the trade-off between detection accuracy and speed. Our algorithm is tested on 14 […]

OpenCL

Dec, 12

Behavioral Non-portability in Scientific Numeric Computing

The precise semantics of floating-point arithmetic programs depends on the execution platform, including the compiler and the target hardware. Platform dependencies are particularly pronounced for arithmetic-intensive parallel numeric programs and infringe on the highly desirable goal of software portability (which is nonetheless promised by heterogeneous computing frameworks like OpenCL): the same program run on the […]

OpenCL

Dec, 12

Large-Scale Compute-Intensive Analysis via a Combined In-Situ and Co-Scheduling Workflow Approach

Large-scale simulations can produce hundreds of terabytes to petabytes of data, complicating and limiting the efficiency of work-flows. Traditionally, outputs are stored on the file system and analyzed in post-processing. With the rapidly increasing size and complexity of simulations, this approach faces an uncertain future. Trending techniques consist of performing the analysis in-situ, utilizing the […]

OpenCL

Dec, 12

Accelerating Exact Similarity Search on CPU-GPU Systems

In recent years, the use of Graphics Processing Units (GPUs) for data mining tasks has become popular. With modern processors integrating both CPUs and GPUs, it is also important to consider what tasks benefit from GPU processing and which do not, and apply a heterogeneous processing approach to improve the efficiency where applicable. Similarity search, […]

OpenCL

Dec, 12

GRATER: An Approximation Workflow for Exploiting Data-Level Parallelism in FPGA Acceleration

Modern applications including graphics, multimedia, web search, and data analytics not only can benefit from acceleration, but also exhibit significant degrees of tolerance to imprecise computation. This amenability to approximation provides an opportunity to trade quality of the results for higher performance and better resource utilization. Exploiting this opportunity is particularly important for FPGA accelerators […]

OpenCL

Dec, 10

MapGraph: A High Level API for Fast Development of High Performance Graph Analytics on GPUs

High performance graph analytics are critical for a long list of application domains. In recent years, the rapid advancement of many-core processors, in particular graphical processing units (GPUs), has sparked a broad interest in developing high performance parallel graph programs on these architectures. However, the SIMT architecture used in GPUs places particular constraints on both […]

CUDA

Dec, 10

Transforming C OpenMP Programs for Verification in CIVL

There are numerous way to express parallelism which can make it challenging for developers to verify these programs. Many tools only target a single dialect but the Concurrency Intermediate Verification Language (CIVL) targets MPI, Pthreads, and CUDA. CIVL provides a general concurrency model that can represent pro- grams in a variety of concurrency dialects. CIVL […]

CUDA

•

OpenCL

Dec, 10

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech–two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach […]

CUDA

Dec, 10

High Performance Histograms on SIMT and SIMD Architectures

Using the histogram procedure, this work studies performance determining factors in computing in parallel on SIMD and SIMT devices. Modern graphics pro-cessing units (GPUs) support SIMT, multiple threads running the same instruction, whereas central processing units (CPUs) use SIMD, in which one instruction op-erates on multiple operands. As part of this work, a cross-technology framework […]

CUDA

•

OpenCL

Dec, 10

Join Execution Using Fragmented Columnar Indices on GPU and MIC

The paper describes an approach to the parallel natural join execution on computing clusters with GPU and MIC Coprocessors. This approach is based on a decomposition of natural join relational operator using the column indices and domain-interval fragmentation. This decomposition admits parallel executing the resource-intensive relational operators without data transfers. All column index fragments are […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

A Survey Of Techniques for Cache Locking

Free-form interest rate term structure decomposition: a 2nd order optimization problem

A Scalable Lane Detection Algorithm on COTSs with OpenCL

Behavioral Non-portability in Scientific Numeric Computing

Large-Scale Compute-Intensive Analysis via a Combined In-Situ and Co-Scheduling Workflow Approach

Accelerating Exact Similarity Search on CPU-GPU Systems

GRATER: An Approximation Workflow for Exploiting Data-Level Parallelism in FPGA Acceleration

MapGraph: A High Level API for Fast Development of High Performance Graph Analytics on GPUs

Transforming C OpenMP Programs for Verification in CIVL

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

High Performance Histograms on SIMT and SIMD Architectures

Join Execution Using Fragmented Columnar Indices on GPU and MIC

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)