9032

Posts

Feb, 20

Complexity Analysis and Algorithm Design for Reorganizing Data to Minimize Non-Coalesced Memory Accesses on GPU

The performance of Graphic Processing Units (GPU) is sensitive to irregular memory references. Some recent work shows the promise of data reorganization for eliminating non-coalesced memory accesses that are caused by irregular references. However, all previous studies have employed simple, heuristic methods to determine the new data layouts to create. As a result, they either […]
Feb, 20

An abstract object oriented runtime system for heterogeneous parallel architecture

In our paper we present an abstract object oriented runtime system that helps to develop scientific application for new hererogenous architecture based on multi-node of multi-core processors enhanced with accelerator boards. Its architecture based on abstract concepts enables to follow hardware technology by extending them with new implementations modeling new hardware components, while limiting the […]
Feb, 20

Streaming Data from HDD to GPUs for Sustained Peak Performance

In the context of the genome-wide association studies (GWAS), one has to solve long sequences of generalized least-squares problems; such a task has two limiting factors: execution time often in the range of days or weeks and data management data sets in the order of Terabytes. We present an algorithm that obviates both issues. By […]
Feb, 20

ClusCo: clustering and comparison of protein models

BACKGROUND: The development, optimization and validation of protein modeling methods require efficient tools for structural comparison. Frequently, a large number of models need to be compared with the target native structure. The main reason for the development of Clusco software was to create a high-throughput tool for all-versus-all comparison, because calculating similarity matrix is the […]
Feb, 20

Implementation and performance evaluation of a GPU particle-in-cell code

In this thesis, I designed and implemented a particle-in-cell (PIC) code on a graphical processing unit (GPU) using NVIDA’s Compute Unified Architecture (CUDA). The massively parallel nature of computing on a GPU nessecitated the development of new methods for various steps of the PIC method. I investigated different algorithms and data structures used in the […]
Feb, 18

Accelerating encryption using commodity hardware

Dedicated hardware encryption offers both low latency and high throughput at the expense of higher cost. A system that would encompass several architectures (SISD/SIMD) with a high number of memory hierarchies might be able to perform close to a dedicated encryption unit at the fraction of the cost. This report establishes the possibility of building […]
Feb, 18

Using High Performance Computing to Improve Image Guided Cancer Treatment

Radiotherapy is one of the main cancer treatments used today. It is a complex process that relies on finding the cancer in the images of the patients with the most accuracy possible in order to minimize the radiation that the surrounding organs receive. Given that a typical radiotherapy treatment process lasts for 6 weeks, ideally, […]
Feb, 18

A Simplified and Accurate Model of Power-Performance Efficiency on Emergent GPU Architectures

Emergent heterogeneous systems must be optimized for both power and performance at exascale. Massive parallelism combined with complex memory hierarchies form a barrier to efficient application and architecture design. These challenges are exacerbated with GPUs as parallelism increases orders of magnitude and power consumption can easily double. Models have been proposed to isolate power and […]
Feb, 18

Offload Compiler Runtime for the Intel Xeon Phi Coprocessor

The Intel Xeon Phi coprocessor platform has a software stack that enables new programming models. One such model is offload of computation from a host processor to a coprocessor that is a fully-functional Intel Architecture CPU, namely, the Intel Xeon Phi coprocessor. The purpose of that offload is to improve response time and/or throughput. The […]
Feb, 18

Formalizing Address Spaces with application to Cuda, OpenCL, and beyond

Cuda and OpenCL are aimed at programmers developing parallel applications targeting GPUs and embedded micro-processors. These systems often have explicitly managed memories exposed directly though a notion of disjoint address spaces. OpenCL address spaces are based on a similar concept found in Embedded C. A limitation of OpenCL is that a specific pointer must be […]
Feb, 15

Hybrid parallel programming – evaluation of OpenACC

OpenACC is a new specification for a hybrid (CPU + GPU) parallel programming API, in which the programmer uses compiler directives to distribute the computation between the GPU and the CPU. With a similar paradigm to OpenMP, OpenACC presents clear advantages in terms of ease of programming. Regarding performance, however, a comparison between OpenACC and […]
Feb, 15

pROST : A Smoothed Lp-norm Robust Online Subspace Tracking Method for Realtime Background Subtraction in Video

An increasing number of methods for background subtraction use Robust PCA to identify sparse foreground objects. While many algorithms use the L1-norm as a convex relaxation of the ideal sparsifying function, we approach the problem with a smoothed Lp-norm and present pROST, a method for robust online subspace tracking. The algorithm is based on alternating […]

Recent source codes

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us: