high performance computing on graphics processing units: hgpu.org

Posts

Sep, 19

Simpler and faster HLBVH with work queues

A recently developed algorithm called Hierachical Linear Bounding Volume Hierarchies (HLBVH) has demonstrated the feasibility of reconstructing the spatial index needed for ray tracing in real-time, even in the presence of millions of fully dynamic triangles. In this work we present a simpler and faster variant of HLBVH, where all the complex book-keeping of prefix […]

CUDA

Sep, 19

A GPU-tailored approach for training kernelized SVMs

We present a method for efficiently training binary and multiclass kernelized SVMs on a Graphics Processing Unit (GPU). Our methods apply to a broad range of kernels, including the popular Gaus- sian kernel, on datasets as large as the amount of available memory on the graphics card. Our approach is distinguished from earlier work in […]

Sep, 19

Stream computing on graphics hardware

The raw compute performance of today’s graphics processor is truly amazing. With peak performance of over 60 GFLOPS, the compute power of the graphics processor (GPU) dwarfs that of today’s commodity CPU at a price of only a few hundred dollars. As the programmability and performance of modern graphics hardware continues to increase, many researchers […]

Sep, 19

A small-world network model for distributed storage of semantic metadata

The growing uptake of semantic web and grid ideas is raising the importance of optimising distribution algorithms for semantic metadata. While it is not yet clear how real-world metadata distribution patterns ought to evolve, practical experience of social and technical networks suggests that a small-world pattern is desirable and practical. We explore simulated small-world networks […]

CUDA

Sep, 19

Using many-core hardware to correlate radio astronomy signals

A recent development in radio astronomy is to replace traditional dishes with many small antennas. The signals are combined to form one large, virtual telescope. The enormous data streams are cross-correlated to filter out noise. This is especially challenging, since the computational demands grow quadratically with the number of data streams. Moreover, the correlator is […]

CUDA

Sep, 19

An adaptative game loop architecture with automatic distribution of tasks between CPU and GPU

This article presents a new architecture to implement all game loop models for games and real-time applications that use the GPU as a mathematics and physics coprocessor, working in parallel processing mode with the CPU. The presented model applies automatic task distribution concepts. The architecture can apply a set of heuristics defined in Lua scripts […]

CUDA

Sep, 19

cuIBM — A GPU-accelerated Immersed Boundary Method

A projection-based immersed boundary method is dominated by sparse linear algebra routines. Using the open-source Cusp library, we observe a speedup (with respect to a single CPU core) which reflects the constraints of a bandwidth-dominated problem on the GPU. Nevertheless, GPUs offer the capacity to solve large problems on commodity hardware. This work includes validation […]

CUDA

Sep, 17

GPU Technology Conference, GTC 2012

GTC advances awareness of high performance computing, and connects the scientists, engineers, researchers, and developers who use GPUs to tackle enormous computational challenges. GTC 2012 will feature the latest breakthroughs and the most amazing content in GPU-enabled computing. Spanning 4 full days of world-class education delivered by some of the greatest minds in GPU computing, […]

Sep, 17

39th International Symposium on Computer Architecture, ISCA 2012

The International Symposium on Computer Architecture is the premier forum for new ideas and experimental results in computer architecture. Novel papers are solicited on a broad range of topics, including (but not limited to): * Processor, memory, and storage systems architecture * Parallel and multi-core systems * Interconnection networks * Instruction, thread, and data-level parallelism […]

Sep, 17

26th IEEE International Parallel & Distributed Processing Symposium, IPDPS 2012

IPDPS is an international forum for engineers and scientists from around the world to present their latest research findings in all aspects of parallel computation. In addition to technical sessions of submitted paper presentations, the meeting offers workshops, tutorials, and commercial presentations & exhibits. IPDPS represents a unique international gathering of computer scientists from around […]

Sep, 16

Returning control to the programmer: SIMD intrinsics for virtual machines

Exposing SIMD units within interpreted languages could simplify programs and unleash floods of untapped processor power. Server and workstation hardware architecture is continually improving, yet interpreted languages-most importantly, Java-have failed to keep pace with the proper utilization of modern processors. SIMD (single instruction, multiple data) units are available in nearly every current desktop and server […]

Sep, 16

NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce

In the last decade, advances in data collection and storage technologies have led to an increased interest in designing and implementing large-scale parallel algorithms for machine learning and data mining (ML-DM). Existing programming paradigms for expressing large-scale parallelism such as MapReduce (MR) and the Message Passing Interface (MPI) have been the de facto choices for […]

OpenCL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Simpler and faster HLBVH with work queues

A GPU-tailored approach for training kernelized SVMs

Stream computing on graphics hardware

A small-world network model for distributed storage of semantic metadata

Using many-core hardware to correlate radio astronomy signals

An adaptative game loop architecture with automatic distribution of tasks between CPU and GPU

cuIBM — A GPU-accelerated Immersed Boundary Method

GPU Technology Conference, GTC 2012

39th International Symposium on Computer Architecture, ISCA 2012

26th IEEE International Parallel & Distributed Processing Symposium, IPDPS 2012

Returning control to the programmer: SIMD intrinsics for virtual machines

NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)