high performance computing on graphics processing units: hgpu.org

Posts

Jun, 22

Concurrent Task Execution on the Intel Xeon Phi

The Intel Xeon Phi coprocessor is a new choice for the high performance computing industry and it needs to be tested. In this thesis, we compared the difference in performance between the Xeon Phi and the GPU. The Smith-Waterman algorithm is a widely used algorithm for solving the sequence alignment problem. We implemented two versions […]

CUDA

Jun, 22

Comparative Study of the Parallelization of the Smith-Waterman Algorithm on OpenMP and Cuda C

In this paper, we present parallel programming approaches to calculate the values of the cells in matrix’s scoring used in the Smith-Waterman’s algorithm for sequence alignment. This algorithm, well known in bioinformatics for its applications, is unfortunately time-consuming on a serial computer. We use formulation based on anti-diagonals structure of data. This representation focuses on […]

CUDA

Jun, 22

Optimization and Parallelization Methods for the Design of Next-Generation Radio Networks

The complexity of the design of radio networks has grown with the adoption of modern standards. Therefore, the role of the computer for the faster delivery of accurate results has become increasingly important. In this thesis, novel methods for the planning and automatic optimization of radio networks are developed and discussed. The state-of-the-art metaheuristic algorithms, […]

OpenCL

Jun, 22

GPU accelerated spectral finite elements on all-hex meshes

This paper presents a spectral element finite element scheme that efficiently solves elliptic problems on unstructured hexahedral meshes. The discrete equations are solved using a matrix-free preconditioned conjugate gradient algorithm. An additive Schwartz two-scale preconditioner is employed that allows h-independence convergence. An extensible multi-threading programming API is used as a common kernel language that allows […]

CUDA

•

OpenCL

Jun, 22

Generating Efficient Tensor Contractions for GPUs

Many scientific and numerical applications, including quantum chemistry modeling and fluid dynamics simulation, require tensor product and tensor contraction evaluation. Tensor computations are characterized by arrays with numerous dimensions, inherent parallelism, moderate data reuse and many degrees of freedom in the order in which to perform the computation. The best-performing implementation is heavily dependent on […]

CUDA

Jun, 19

Autotuning Tensor Contraction Computations on GPUs

We describe a framework for generating optimized GPU code for computing tensor contractions, a multidimensional generalization of matrix-matrix multiplication that arises frequently in computational science applications. Typical performance optimization strategies for such computations transform the tensors into sequences of matrix-matrix multiplications to take advantage of an optimized BLAS library, but this approach is not appropriate […]

CUDA

Jun, 19

Study of Sparse-Matrix Vector Multiplication (SpMV) on Different Architectures and Libraries

With the advent of parallel processing architectures and a steep increase in parallelism found among the recent applications, GPGPUs have gained attention with respect to their importance in the execution of these applications. In this document, we specifically analyze Sparse-Matrix Vector Multiplication(SPMV) across different architectures, libraries and matrix formats. The experimental platforms include but are […]

CUDA

•

OpenCL

Jun, 19

Bulk GCD Computation Using a GPU to Break Weak RSA Keys

RSA is one the most well-known public-key cryptosystems widely used for secure data transfer. An RSA encryption key includes a modulus n which is the product of two large prime numbers p and q. If an RSA modulus n can be decomposed into p and q, the corresponding decryption key can be computed easily from […]

CUDA

Jun, 19

Parallel BTF Compression with Multi-Level Vector Quantization in OpenCL

Bidirectional Texture Function (BTF) as an effective visual fidelity representation of surface appearance is becoming more and more widely used. In this paper we report on contributions to BTF data compression for multi-level vector quantization. We describe novel decompositions that improve the compression ratio by 15% in comparison with the original method, without loss of […]

OpenCL

Jun, 19

Accelerated dimension-independent adaptive Metropolis

This work considers black-box Bayesian inference over high-dimensional parameter spaces. The well-known adaptive Metropolis (AM) algorithm of (Haario etal. 2001) is extended herein to scale asymptotically uniformly with respect to the underlying parameter dimension for Gaussian targets, by respecting the variance of the target. The resulting algorithm, referred to as the dimension-independent adaptive Metropolis (DIAM) […]

CUDA

Jun, 19

Visualization of OpenCL Application Execution on CPU-GPU Systems

Evaluating the performance of parallel and heterogeneous programs and architectures can be challenging. An emulator or simulator can be used to aid the programmer. To provide guidance and feedback to the programmer, the simulator needs to present traces, reports, and debugging information in a coherent and unambiguous format. Although these outputs contain a lot of […]

CUDA

•

OpenCL

Jun, 17

2nd International Conference on Mechanical, Aeronautical and Automotive Engineering (ICMAA), 2015

Topics: Mechanical Engineering Applied Mechanics Automation Biomechanics Computational Fluid Dynamics Design and Manufacturing Energy Management Fluid Dynamics Fuels and Combustion Green Manufacturing Heat and Mass Transfer Industrial Tribology Instrumentation and Control Internal Combustion Engines Mechatronics Micro-Machining Modeling of Processes Nano- Technology Optimization of Systems Renewable and Non-Renewable Energies Reverse Engineering Robotics Solid Mechanics Oil and […]

high performance computing on graphics processing units: hgpu.org

Posts

Concurrent Task Execution on the Intel Xeon Phi

Comparative Study of the Parallelization of the Smith-Waterman Algorithm on OpenMP and Cuda C

Optimization and Parallelization Methods for the Design of Next-Generation Radio Networks

GPU accelerated spectral finite elements on all-hex meshes

Generating Efficient Tensor Contractions for GPUs

Autotuning Tensor Contraction Computations on GPUs

Study of Sparse-Matrix Vector Multiplication (SpMV) on Different Architectures and Libraries

Bulk GCD Computation Using a GPU to Break Weak RSA Keys

Parallel BTF Compression with Multi-Level Vector Quantization in OpenCL

Accelerated dimension-independent adaptive Metropolis

Visualization of OpenCL Application Execution on CPU-GPU Systems

2nd International Conference on Mechanical, Aeronautical and Automotive Engineering (ICMAA), 2015

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)