high performance computing on graphics processing units: hgpu.org

Posts

Sep, 19

A small-world network model for distributed storage of semantic metadata

The growing uptake of semantic web and grid ideas is raising the importance of optimising distribution algorithms for semantic metadata. While it is not yet clear how real-world metadata distribution patterns ought to evolve, practical experience of social and technical networks suggests that a small-world pattern is desirable and practical. We explore simulated small-world networks […]

CUDA

Sep, 19

Using many-core hardware to correlate radio astronomy signals

A recent development in radio astronomy is to replace traditional dishes with many small antennas. The signals are combined to form one large, virtual telescope. The enormous data streams are cross-correlated to filter out noise. This is especially challenging, since the computational demands grow quadratically with the number of data streams. Moreover, the correlator is […]

CUDA

Sep, 19

An adaptative game loop architecture with automatic distribution of tasks between CPU and GPU

This article presents a new architecture to implement all game loop models for games and real-time applications that use the GPU as a mathematics and physics coprocessor, working in parallel processing mode with the CPU. The presented model applies automatic task distribution concepts. The architecture can apply a set of heuristics defined in Lua scripts […]

CUDA

Sep, 19

cuIBM — A GPU-accelerated Immersed Boundary Method

A projection-based immersed boundary method is dominated by sparse linear algebra routines. Using the open-source Cusp library, we observe a speedup (with respect to a single CPU core) which reflects the constraints of a bandwidth-dominated problem on the GPU. Nevertheless, GPUs offer the capacity to solve large problems on commodity hardware. This work includes validation […]

CUDA

Sep, 17

GPU Technology Conference, GTC 2012

GTC advances awareness of high performance computing, and connects the scientists, engineers, researchers, and developers who use GPUs to tackle enormous computational challenges. GTC 2012 will feature the latest breakthroughs and the most amazing content in GPU-enabled computing. Spanning 4 full days of world-class education delivered by some of the greatest minds in GPU computing, […]

Sep, 17

39th International Symposium on Computer Architecture, ISCA 2012

The International Symposium on Computer Architecture is the premier forum for new ideas and experimental results in computer architecture. Novel papers are solicited on a broad range of topics, including (but not limited to): * Processor, memory, and storage systems architecture * Parallel and multi-core systems * Interconnection networks * Instruction, thread, and data-level parallelism […]

Sep, 17

26th IEEE International Parallel & Distributed Processing Symposium, IPDPS 2012

IPDPS is an international forum for engineers and scientists from around the world to present their latest research findings in all aspects of parallel computation. In addition to technical sessions of submitted paper presentations, the meeting offers workshops, tutorials, and commercial presentations & exhibits. IPDPS represents a unique international gathering of computer scientists from around […]

Sep, 16

Returning control to the programmer: SIMD intrinsics for virtual machines

Exposing SIMD units within interpreted languages could simplify programs and unleash floods of untapped processor power. Server and workstation hardware architecture is continually improving, yet interpreted languages-most importantly, Java-have failed to keep pace with the proper utilization of modern processors. SIMD (single instruction, multiple data) units are available in nearly every current desktop and server […]

Sep, 16

NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce

In the last decade, advances in data collection and storage technologies have led to an increased interest in designing and implementing large-scale parallel algorithms for machine learning and data mining (ML-DM). Existing programming paradigms for expressing large-scale parallelism such as MapReduce (MR) and the Message Passing Interface (MPI) have been the de facto choices for […]

OpenCL

Sep, 16

Programmable and Scalable Architecture for Graphics Processing Units

Graphics processing is an application area with high level of parallelism at the data level and at the task level. Therefore, graphics processing units (GPU) are often implemented as multiprocessing systems with high performance floating point processing and application specific hardware stages for maximizing the graphics throughput. In this paper we evaluate the suitability of […]

OpenCL

•

OpenGL

Sep, 16

Searching for Concurrent Design Patterns in Video Games

The transition to multicore architectures has dramatically underscored the necessity for parallelism in software. In particular, while new gaming consoles are by and large multicore, most existing video game engines are essentially sequential and thus cannot easily take advantage of this hardware. In this paper we describe techniques derived from our experience parallelizing an open-source […]

Sep, 16

A Light-Weight Approach to Dynamical Runtime Linking Supporting Heterogenous, Parallel, and Reconfigurable Architectures

When targeting hardware accelerators and reconfigurable processing units, the question of programmability arises, i.e. how different implementations of individual, configuration-specific functions are provided. Conventionally, this is resolved either at compilation time with a specific hardware environment being targeted, by initialization routines at program start, or decision trees at run-time. Such technique are, however, hardly applicable […]

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

GigaAPI for GPU Parallelization

high performance computing on graphics processing units: hgpu.org

Posts

A small-world network model for distributed storage of semantic metadata

Using many-core hardware to correlate radio astronomy signals

An adaptative game loop architecture with automatic distribution of tasks between CPU and GPU

cuIBM — A GPU-accelerated Immersed Boundary Method

GPU Technology Conference, GTC 2012

39th International Symposium on Computer Architecture, ISCA 2012

26th IEEE International Parallel & Distributed Processing Symposium, IPDPS 2012

Returning control to the programmer: SIMD intrinsics for virtual machines

NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce

Programmable and Scalable Architecture for Graphics Processing Units

Searching for Concurrent Design Patterns in Video Games

A Light-Weight Approach to Dynamical Runtime Linking Supporting Heterogenous, Parallel, and Reconfigurable Architectures

Recent source codes

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Data-efficient LLM Fine-tuning for Code Generation

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

Coccinelle: a C code transformation engine using SmPL for matches, refactorings, and bug fixing

DuoReduce: MLIR's benchmark

Shamrock: Multi-GPU hydrodynamics for astrophysics

LLMPerf: GPU Performance Modeling meets Large Language Models

Most viewed papers (last 30 days)