13421

Posts

Jan, 26

Adjoint Lattice Boltzmann for Topology Optimization on multi-GPU architecture

In this paper we present a topology optimization technique applicable to a broad range of flow design problems. We propose also a discrete adjoint formulation effective for a wide class of Lattice Boltzmann Methods (LBM). This adjoint formulation is used to calculate sensitivity of the LBM solution to several type of parameters, both global and […]
Jan, 26

A High Performance Framework for Coupled Urban Microclimate Models

Urban form modifies the microclimate and may trap in heat and pollutants. This causes a rise of energy demands to heat and cool building interiors. Mitigating these effects is a growing concern due to the increasing urbanization of major cities. Researchers, urban planners, and city architects rely on sophisticated simulations to investigate how to reduce […]
Jan, 26

Tangram: a High-level Language for Performance Portable Code Synthesis

We propose Tangram, a general-purpose high-level language that achieves high performance across architectures. In Tangram, a program is written by synthesizing elemental pieces of code snippets, called codelets. A codelet can have multiple semantic-preserving implementations to enable automated algorithm and implementation selection. An implementation of a codelet can be written with tunable knobs to allow […]
Jan, 26

GPU computing architecture for irregular parallelism

Many applications with regular parallelism have been shown to benefit from using Graphics Processing Units (GPUs). However, employing GPUs for applications with irregular parallelism tends to be a risky process, involving significant effort from the programmer and an uncertain amount of performance/efficiency benefit. One known challenge in developing GPU applications with irregular parallelism is the […]
Jan, 26

Performance Analysis of Join Algorithms on GPUs

Implementing database operations on parallel platforms has gain a lot of momentum in the past decade, due to the increasing popularity of many-core processors. A number of studies have shown the potential of using GPUs to speed up database operations. In this paper, we present empirical evaluations of a state-of-the-art work published in SIGMOD’08 on […]
Jan, 23

Real-time physically cloth simulation with CUDA

With the development of the simulation technique, deformable cloth simulation has become highly desired. It can be widely used in many fields such as game, animation, virtual surgery, etc. Real-time algorithm is the most urgent bottleneck problem that needs to be solved. This paper introduces a solution to implement deformable simulation of cloth in real […]
Jan, 23

Revisit Long Short-Term Memory: An Optimization Perspective

Long Short-Term Memory (LSTM) is a deep recurrent neural network architecture with high computational complexity. Contrary to the standard practice to train LSTM online with stochastic gradient descent (SGD) methods, we propose a matrix-based batch learning method for LSTM with full Backpropagation Through Time (BPTT). We further solve the state drifting issues as well as […]
Jan, 23

Taming the complexities of the C11 and OpenCL memory models

We study how the C11 memory model can be simplified and how it can be extended. Our first contribution is to propose a mild strengthening of the model that enables the rules pertaining to sequentially-consistent (SC) operations to be significantly simplified. We eliminate one of the total orders that candidate executions must range over, leading […]
Jan, 23

Can Portability Improve Performance? An Empirical Study of Parallel Graph Analytics

Due to increasingly large datasets, graph analytics – traversals, all-pairs shortest path computations, centrality measures, etc. – are becoming the focus of high-performance computing (HPC). Because HPC is currently dominated by many-core architectures (both CPUs and GPUs), new graph processing solutions have to be defined to efficiently use such computing resources. Prior work focuses on […]
Jan, 23

Gunrock: A High-Performance Graph Processing Library on the GPU

For large-scale graph analytics on the GPU, the irregularity of data access and control flow and the complexity of programming GPUs have been two significant challenges for developing a programmable high-performance graph library. "Gunrock", our graph-processing system, uses a high-level bulk-synchronous abstraction with traversal and computation steps, designed specifically for the GPU. Gunrock couples high […]
Jan, 21

Reproducible and Accurate Matrix Multiplication for GPU Accelerators

Due to non-associativity of floating-point operations and dynamic scheduling on parallel architectures, getting a bitwise reproducible floating-point result for multiple executions of the same code on different or even similar parallel architectures is challenging. In this paper, we address the problem of reproducibility in the context of matrix multiplication and propose an algorithm that yields […]
Jan, 21

GPU concurrency: Weak behaviours and programming assumptions

Concurrency is pervasive and perplexing, particularly on graphics processing units (GPUs). Current specifications of languages and hardware are inconclusive; thus programmers often rely on folklore assumptions when writing software. To remedy this state of affairs, we conducted a large empirical study of the concurrent behaviour of deployed GPUs. Armed with litmus tests (i.e. short concurrent […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us:

contact@hpgu.org