John Wickerson, Mark Batty
We study how the C11 memory model can be simplified and how it can be extended. Our first contribution is to propose a mild strengthening of the model that enables the rules pertaining to sequentially-consistent (SC) operations to be significantly simplified. We eliminate one of the total orders that candidate executions must range over, leading […]
Koji Nakano
The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of architecture of CUDA-enabled GPUs. The main contribution of this paper is to present an efficient implementation of the O(n^3)-time dynamic programming algorithm for solving the optimal triangulation problem for a convex n-gon in the HMM. Although the HMM can […]
View View   Download Download (PDF)   
Sparsh Mittal; Jeffrey S. Vetter; Dong Li
Recent trends of CMOS scaling and increasing number of on-chip cores have led to a large increase in the size of on-chip caches. Since SRAM has low density and consumes large amount of leakage power, its use in designing on-chip caches has become more challenging. To address this issue, researchers are exploring the use of […]
View View   Download Download (PDF)   
Tyler Rey Sorensen
Graphics Processing Units (GPUs) are highly parallel shared memory microprocessors, and as such, they are prone to the same concurrency considerations as their traditional multicore CPU counterparts. In this thesis, we consider shared memory consistency, i.e. what values can be read when issued concurrently with writes on current GPU hardware. While memory consistency has been […]
View View   Download Download (PDF)   
Koji Nakano, Susumu Matsumae, Yasuaki Ito
The Discrete Memory Machine (DMM) is a theoretical parallel computing model that captures the essence of memory access to the shared memory of a streaming multiprocessor on CUDA-enabled GPUs. The DMM has w memory banks that constitute a shared memory, and w threads in a warp try to access them at the same time. However, […]
View View   Download Download (PDF)   
Bin Ren
SIMD accelerators and many-core coprocessors with coarse-grained and fine-grained level parallelism, become more and more popular. Streaming SIMD Extensions (SSE), Graphics Processing Unit (GPU), and Intel Xeon Phi (MIC) can provide orders of magnitude better performance and efficiency for parallel workloads as compared to single core CPUs. However, parallelizing irregular applications involving dynamic data structures […]
View View   Download Download (PDF)   
John Cheng, Max Grossman, Ty McKercher
Designed for professionals across multiple industrial sectors, Professional CUDA C Programming presents CUDA — a parallel computing platform and programming model designed to ease the development of GPU programming — fundamentals in an easy-to-follow format, and teaches readers how to think in parallel and implement parallel algorithms on GPUs. Each chapter covers a specific topic, […]
View View   Download Download (PDF)   
Raphael Landaverde, Tiansheng Zhang, Ayse K. Coskun, Martin Herbordt
Managing memory between the CPU and GPU is a major challenge in GPU computing. A programming model, Unified Memory Access (UMA), has been recently introduced by Nvidia to simplify the complexities of memory management while claiming good overall performance. In this paper, we investigate this programming model and evaluate its performance and programming model simplifications […]
View View   Download Download (PDF)   
Vassilis Vassiliadis
The target of this thesis is to optimize memory management on heterogeneous systems. Our approach involves performing memory access pattern analysis on kernels in order to produce an accurate estimation of the memory usage. This information is produced in the form of array ranges describing which elements are accessed as well as whether they are […]
Bo Wu
This master thesis focuses on several high-level parallel programming models for heterogeneous systems that have been becoming increasingly popular in the field of high-performance computing. Heterogeneous systems are an inexpensive and effective way for further performance improvements. A powerful combination of graphics processing units (GPUs) and central processing units (CPUs) is one of the most […]
View View   Download Download (PDF)   
Kazuya Tani, Daisuke Takafuji, Koji Nakano, Yasuaki Ito
The Unified Memory Machine (UMM) is a theoretical parallel computing model that captures the essence of the global memory access of GPUs. A sequential algorithm is oblivious if an address accessed at each time does not depend on input data. Many important tasks including matrix computation, signal processing, sorting, dynamic programming, and encryption/decryption can be […]
View View   Download Download (PDF)   
Robert Strzodka
Scientific data is mostly multi-valued, e.g., coordinates, velocities, moments or feature components, and it comes in large quantities. The data layout of such containers has an enormous impact on the achieved performance, however, layout optimization is very time-consuming and error-prone because container access syntax in standard programming languages is not sufficiently abstract. This means that […]
View View   Download Download (PDF)   
Page 1 of 812345...Last »

* * *

* * *

Like us on Facebook

HGPU group

229 people like HGPU on Facebook

Follow us on Twitter

HGPU group

1424 peoples are following HGPU @twitter

* * *

Free GPU computing nodes at hgpu.org

Registered users can now run their OpenCL application at hgpu.org. We provide 1 minute of computer time per each run on two nodes with two AMD and one nVidia graphics processing units, correspondingly. There are no restrictions on the number of starts.

The platforms are

Node 1
  • GPU device 0: nVidia GeForce GTX 560 Ti 2GB, 822MHz
  • GPU device 1: AMD/ATI Radeon HD 6970 2GB, 880MHz
  • CPU: AMD Phenom II X6 @ 2.8GHz 1055T
  • RAM: 12GB
  • OS: OpenSUSE 13.1
  • SDK: nVidia CUDA Toolkit 6.5.14, AMD APP SDK 3.0
Node 2
  • GPU device 0: AMD/ATI Radeon HD 7970 3GB, 1000MHz
  • GPU device 1: AMD/ATI Radeon HD 5870 2GB, 850MHz
  • CPU: Intel Core i7-2600 @ 3.4GHz
  • RAM: 16GB
  • OS: OpenSUSE 12.3
  • SDK: AMD APP SDK 3.0

Completed OpenCL project should be uploaded via User dashboard (see instructions and example there), compilation and execution terminal output logs will be provided to the user.

The information send to hgpu.org will be treated according to our Privacy Policy

HGPU group © 2010-2015 hgpu.org

All rights belong to the respective authors

Contact us: