high performance computing on graphics processing units: hgpu.org

Posts

Nov, 13

Accelerating Recommender Systems using GPUs

We describe GPU implementations of the matrix recommender algorithms CCD++ and ALS. We compare the processing time and predictive ability of the GPU implementations with existing multi-core versions of the same algorithms. Results on the GPU are better than the results of the multi-core versions (maximum speedup of 14.8).

CUDA

Nov, 13

A Survey Of Techniques for Architecting and Managing Asymmetric Multicore Processors

To meet the needs of diverse range of workloads, asymmetric multicore processors (AMPs) have been proposed, which feature cores of different microarchitecture or ISAs. However, given the diversity inherent in their design and application scenarios, several challenges need to be addressed to effectively architect AMPs and leverage their potential in optimizing both sequential and parallel […]

Nov, 12

FIESTA 4: optimized Feynman integral calculations with GPU support

This paper presents a new major release of the program FIESTA (Feynman Integral Evaluation by a Sector decomposiTion Approach). The new release is mainly aimed at optimal performance at large scales when one is increasing the number of sampling points in order to reduce the uncertainty estimates. The release now supports graphical processor units (GPU) […]

CUDA

Nov, 12

Microlensing Observations Rapid Search for Exoplanets: MORSE code for GPUs

The rapid analysis of ongoing gravitational microlensing events has been integral to the successful detection and characterisation of cool planets orbiting low mass stars in the Galaxy. In this paper we present an implementation of search and fit techniques on Graphical Processing Unit hardware. The method allows for the rapid identification of candidate planetary microlensing […]

CUDA

Nov, 12

A polyphase filter for many-core architectures

In this article we discuss our implementation of a polyphase filter for real-time data processing in radio astronomy. We describe in detail our implementation of the polyphase filter algorithm and its behaviour on three generations of NVIDIA GPU cards, on dual Intel Xeon CPUs and the Intel Xeon Phi (Knights Corner) platforms. All of our […]

CUDA

Nov, 12

Assembly-Free Structural Dynamics On CPU and GPU

Finite Element Analysis helps designers at the early stages of product design through simulation and behavioral prediction. This thesis is on transient finite element analysis, specifically, structural dynamics, where the behavior of a product due to time-dependent loads is desired. A critical computational challenge in structural dynamics is that it typically requires significant amounts of […]

CUDA

Nov, 12

Comparison of parallel sorting algorithms

In our study we implemented and compared seven sequential and parallel sorting algorithms: bitonic sort, multistep bitonic sort, adaptive bitonic sort, merge sort, quicksort, radix sort and sample sort. Sequential algorithms were implemented on a central processing unit using C++, whereas parallel algorithms were implemented on a graphics processing unit using CUDA platform. We chose […]

CUDA

Nov, 11

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

TensorFlow [1] is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines […]

CUDA

Nov, 11

Autotuning OpenCL Workgroup Size for Stencil Patterns

Selecting an appropriate workgroup size is critical for the performance of OpenCL kernels, and requires knowledge of the underlying hardware, the data being operated on, and the implementation of the kernel. This makes portable performance of OpenCL programs a challenging goal, since simple heuristics and statically chosen values fail to exploit the available performance. To […]

OpenCL

Nov, 11

Climbing Mont Blanc – A Training Site for Energy Efficient Programming on Heterogeneous Multicore Processors

Climbing Mont Blanc (CMB) is an open online judge used for training in energy efficient programming of state-of-the-art heterogeneous multicores. It uses an Odroid-XU3 board from Hardkernel with an Exynos Octa processor and integrated power sensors. This processor is three-way heterogeneous containing 14 different cores of three different types. The board currently accepts C and […]

OpenCL

Nov, 11

Integrating a large-scale testing campaign in the CK framework

We consider the problem of conducting large experimental campaigns in computer science research. Most research efforts require a certain level of bookkeeping of results. This is manageable via quick, on-the-fly infrastructure implementations. However, it becomes a problem for large-scale testing initiatives, especially as the needs of the project evolve along the way. We look at […]

OpenCL

Nov, 11

Evaluating 3-D Stencil codes on Intel Xeon Phi: Limitations and Trade-offs

Accelerators like Intel Xeon Phi aim to fulfill the computational requirements of modern applications. A particular interest to us are those applications that are based on Stencil Computations. Stencils are finite-difference algorithms used in many scientific and engineering applications for solving large-scale and high-dimension partial differential equations. Programmability on massively parallel architectures of such kernels […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Accelerating Recommender Systems using GPUs

A Survey Of Techniques for Architecting and Managing Asymmetric Multicore Processors

FIESTA 4: optimized Feynman integral calculations with GPU support

Microlensing Observations Rapid Search for Exoplanets: MORSE code for GPUs

A polyphase filter for many-core architectures

Assembly-Free Structural Dynamics On CPU and GPU

Comparison of parallel sorting algorithms

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Autotuning OpenCL Workgroup Size for Stencil Patterns

Climbing Mont Blanc – A Training Site for Energy Efficient Programming on Heterogeneous Multicore Processors

Integrating a large-scale testing campaign in the CK framework

Evaluating 3-D Stencil codes on Intel Xeon Phi: Limitations and Trade-offs

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)