13395

Posts

Jan, 26

Tangram: a High-level Language for Performance Portable Code Synthesis

We propose Tangram, a general-purpose high-level language that achieves high performance across architectures. In Tangram, a program is written by synthesizing elemental pieces of code snippets, called codelets. A codelet can have multiple semantic-preserving implementations to enable automated algorithm and implementation selection. An implementation of a codelet can be written with tunable knobs to allow […]
Jan, 26

GPU computing architecture for irregular parallelism

Many applications with regular parallelism have been shown to benefit from using Graphics Processing Units (GPUs). However, employing GPUs for applications with irregular parallelism tends to be a risky process, involving significant effort from the programmer and an uncertain amount of performance/efficiency benefit. One known challenge in developing GPU applications with irregular parallelism is the […]
Jan, 26

Performance Analysis of Join Algorithms on GPUs

Implementing database operations on parallel platforms has gain a lot of momentum in the past decade, due to the increasing popularity of many-core processors. A number of studies have shown the potential of using GPUs to speed up database operations. In this paper, we present empirical evaluations of a state-of-the-art work published in SIGMOD’08 on […]
Jan, 23

Real-time physically cloth simulation with CUDA

With the development of the simulation technique, deformable cloth simulation has become highly desired. It can be widely used in many fields such as game, animation, virtual surgery, etc. Real-time algorithm is the most urgent bottleneck problem that needs to be solved. This paper introduces a solution to implement deformable simulation of cloth in real […]
Jan, 23

Revisit Long Short-Term Memory: An Optimization Perspective

Long Short-Term Memory (LSTM) is a deep recurrent neural network architecture with high computational complexity. Contrary to the standard practice to train LSTM online with stochastic gradient descent (SGD) methods, we propose a matrix-based batch learning method for LSTM with full Backpropagation Through Time (BPTT). We further solve the state drifting issues as well as […]
Jan, 23

Taming the complexities of the C11 and OpenCL memory models

We study how the C11 memory model can be simplified and how it can be extended. Our first contribution is to propose a mild strengthening of the model that enables the rules pertaining to sequentially-consistent (SC) operations to be significantly simplified. We eliminate one of the total orders that candidate executions must range over, leading […]
Jan, 23

Can Portability Improve Performance? An Empirical Study of Parallel Graph Analytics

Due to increasingly large datasets, graph analytics – traversals, all-pairs shortest path computations, centrality measures, etc. – are becoming the focus of high-performance computing (HPC). Because HPC is currently dominated by many-core architectures (both CPUs and GPUs), new graph processing solutions have to be defined to efficiently use such computing resources. Prior work focuses on […]
Jan, 23

Gunrock: A High-Performance Graph Processing Library on the GPU

For large-scale graph analytics on the GPU, the irregularity of data access and control flow and the complexity of programming GPUs have been two significant challenges for developing a programmable high-performance graph library. "Gunrock", our graph-processing system, uses a high-level bulk-synchronous abstraction with traversal and computation steps, designed specifically for the GPU. Gunrock couples high […]
Jan, 21

Reproducible and Accurate Matrix Multiplication for GPU Accelerators

Due to non-associativity of floating-point operations and dynamic scheduling on parallel architectures, getting a bitwise reproducible floating-point result for multiple executions of the same code on different or even similar parallel architectures is challenging. In this paper, we address the problem of reproducibility in the context of matrix multiplication and propose an algorithm that yields […]
Jan, 21

GPU concurrency: Weak behaviours and programming assumptions

Concurrency is pervasive and perplexing, particularly on graphics processing units (GPUs). Current specifications of languages and hardware are inconclusive; thus programmers often rely on folklore assumptions when writing software. To remedy this state of affairs, we conducted a large empirical study of the concurrent behaviour of deployed GPUs. Armed with litmus tests (i.e. short concurrent […]
Jan, 21

A Novel Implementation of QuickHull Algorithm on the GPU

We present a novel GPU-accelerated implementation of the QuickHull algorithm for calculating convex hulls of planar point sets. We also describe a practical solution to demonstrate how to efficiently implement a typical Divide-and-Conquer algorithm on the GPU. We highly utilize the parallel primitives provided by the library Thrust such as the parallel segmented scan for […]
Jan, 21

Opportunities for Nonvolatile Memory Systems in Extreme-Scale High Performance Computing

For extreme-scale high performance computing systems, system-wide power consumption has been identified as one of the key constraints moving forward, where the DRAM main memory systems account for about 30-50% of a node's overall power consumption. Moreover, as the benefits of device scaling for DRAM memory slow, it will become increasingly difficult to keep memory […]
Page 2 of 78212345...102030...Last »

* * *

* * *

* * *

Free GPU computing nodes at hgpu.org

Registered users can now run their OpenCL application at hgpu.org. We provide 1 minute of computer time per each run on two nodes with two AMD and one nVidia graphics processing units, correspondingly. There are no restrictions on the number of starts.

The platforms are

Node 1
  • GPU device 0: nVidia GeForce GTX 560 Ti 2GB, 822MHz
  • GPU device 1: AMD/ATI Radeon HD 6970 2GB, 880MHz
  • CPU: AMD Phenom II X6 @ 2.8GHz 1055T
  • RAM: 12GB
  • OS: OpenSUSE 13.1
  • SDK: nVidia CUDA Toolkit 6.5.14, AMD APP SDK 3.0
Node 2
  • GPU device 0: AMD/ATI Radeon HD 7970 3GB, 1000MHz
  • GPU device 1: AMD/ATI Radeon HD 5870 2GB, 850MHz
  • CPU: Intel Core i7-2600 @ 3.4GHz
  • RAM: 16GB
  • OS: OpenSUSE 12.2
  • SDK: AMD APP SDK 2.9

Completed OpenCL project should be uploaded via User dashboard (see instructions and example there), compilation and execution terminal output logs will be provided to the user.

The information send to hgpu.org will be treated according to our Privacy Policy

HGPU group © 2010-2015 hgpu.org

All rights belong to the respective authors

Contact us: