3752

Posts

Apr, 20

Design and implementation of software-managed caches for multicores with local memory

Heterogeneous multicores, such as Cell BE processors and GPGPUs, typically do not have caches for their accelerator cores because coherence traffic, cache misses, and latencies from different types of memory accesses add overhead and adversely affect instruction scheduling. Instead, the accelerator cores have internal local memory to place their code and data. Programmers of such […]
Apr, 20

Compressing Floating-Point Number Stream for Numerical Applications

A cluster of commodity computers and general-purpose computers with accelerators such as GPGPUs are now common platforms to solve computationally intensive tasks like scientific simulations. Both technologies provide users with high performance at relatively low cost. However, the low bandwidth of interconnect compared to the computing performance hinders efficient operation of both cluster and accelerator […]
Apr, 19

HPP-Controller: An intra-node controller designed for connecting heterogeneous CPUs

Heterogeneity is considered as a solution for supercomputers to scale to petascale. Many systems which are composed of general CPUs and special processing units such as Cells, GPGPUs and FPGAs have been implemented. In these systems, CPU needs interact with special processing units to process data together, thus communications between these heterogeneous processing units become […]
Apr, 19

Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?

To extend the exponential performance scaling of future chip multiprocessors, improving energy efficiency has become a first-class priority. Single-chip heterogeneous computing has the potential to achieve greater energy efficiency by combining traditional processors with unconventional cores (U-cores) such as custom logic, FPGAs, or GPGPUs. Although U-cores are effective at increasing performance, their benefits can also […]
Apr, 19

Parallel Approaches for SWAMP Sequence Alignment

This document is a summary and overview of several approaches to implement the local sequence alignment algorithms known as SWAMP and SWAMP+ on commercially available hardware. Using a Smith-Waterman style of alignment, these parallel algorithms have several innovative extensions that take advantage of the ASC associative computing model while maintaining speed, accuracy, and producing a […]
Apr, 19

A Hybrid Analytical DRAM Performance Model

As process technology scales, the number of transistors that can fit in a unit area has increased exponentially. Processor throughput, memory storage, and memory throughput have all been increasing at an exponential pace. As such, DRAM has become an ever-tightening bottleneck for applications with irregular memory access patterns. Computer architects in industry sometimes use ad […]
Apr, 19

Extending the Scalability of Single Chip Stream Processors with On-chip Caches

As semiconductor scaling continues, more transistors can be put onto the same chip despite growing challenges in clock frequency scaling. Stream processor architectures can make effective use of these additional resources for appropriate applications. However, it is important that programmer effort be amortized across future generations of stream processor architectures. Current industry projections suggest a […]
Apr, 19

A Task-centric Memory Model for Scalable Accelerator Architectures

This paper presents a task-centric memory model for 1000-core compute accelerators. Visual computing applications are emerging as an important class of workloads that can exploit 1000-core processors. In these workloads, we observe data sharing and communication patterns that can be leveraged in the design of memory systems for future 1000-core processors. Based on these insights, […]
Apr, 19

Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware

Recent advances in graphics processing units (GPUs) have resulted in massively parallel hardware that is easily programmable and widely available in today’s desktop and notebook computer systems. GPUs typically use single-instruction, multiple-data (SIMD) pipelines to achieve high performance with minimal overhead for control hardware. Scalar threads running the same computing kernel are grouped together into […]
Apr, 19

Dynamic detection of uniform and affine vectors in GPGPU computations

We present a hardware mechanism which dynamically detects uniform and affine vectors used in SPMD architecture such as Graphics Processing Units, to minimize pressure on the register file and reduce power consumption with minimal architectural modifications. A preliminary experimental analysis conducted with the Barra simulator shows that this optimization can benefit up to 34 % […]
Apr, 19

Parallel calculation of the median and order statistics on GPUs with application to robust regression

We present and compare various approaches to a classical selection problem on Graphics Processing Units (GPUs). The selection problem consists in selecting the $k$-th smallest element from an array of size $n$, called $k$-th order statistic. We focus on calculating the median of a sample, the $n/2$-th order statistic. We introduce a new method based […]
Apr, 19

A pseudospectral matrix method for time-dependent tensor fields on a spherical shell

We construct a pseudospectral method for the solution of time-dependent, non-linear partial differential equations on a three-dimensional spherical shell. The problem we address is the treatment of tensor fields on the sphere. As a test case we consider the evolution of a single black hole in numerical general relativity. A natural strategy would be the […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: