Posts
Apr, 20
Accelerating Linpack Performance with Mixed Precision Algorithm on CPU+GPGPU Heterogeneous Cluster
In this paper, the mixed precision algorithm to solve the linear system of equations and the implementation of HPL package are introduced. We use this mixed precision algorithm to improve HPL package on CPU + GPGPU heterogeneous clusters, which is named for GHPL, and give the implementation mechanisms in detail. The experimental results are measured […]
Apr, 20
Discrete-event Execution Alternatives on General Purpose Graphical Processing Units (GPGPUs)
Graphics cards, traditionally designed as accelerators for computer graphics, have evolved to support more general-purpose computation. General Purpose Graphical Processing Units (GPGPUs) are now being used as highly efficient, cost-effective platforms for executing certain simulation applications. While most of these applications belong to the category of timestepped simulations, little is known about the applicability of […]
Apr, 20
An efficient GPU implementation of the revised simplex method
The computational power provided by the massive parallelism of modern graphics processing units (GPUs) has moved increasingly into focus over the past few years. In particular, general purpose computing on GPUs (GPGPU) is attracting attention among researchers and practitioners alike. Yet GPGPU research is still in its infancy, and a major challenge is to rearrange […]
Apr, 20
Tutorial 3: Methodologies and Performance Impacts of General Purpose Computing on GPUs
Graphics Processing Units (GPUs) has been applied to graphics applications to implement realistic perspectives of virtual scenes especially in entertainment market. Due to the demands from the market for creating super high definition scenes with high frame rate that simulates physics phenomenon naturally in visualization applications, the last decade promoted drastic performance improvement of GPUs. […]
Apr, 20
Design and implementation of software-managed caches for multicores with local memory
Heterogeneous multicores, such as Cell BE processors and GPGPUs, typically do not have caches for their accelerator cores because coherence traffic, cache misses, and latencies from different types of memory accesses add overhead and adversely affect instruction scheduling. Instead, the accelerator cores have internal local memory to place their code and data. Programmers of such […]
Apr, 20
Compressing Floating-Point Number Stream for Numerical Applications
A cluster of commodity computers and general-purpose computers with accelerators such as GPGPUs are now common platforms to solve computationally intensive tasks like scientific simulations. Both technologies provide users with high performance at relatively low cost. However, the low bandwidth of interconnect compared to the computing performance hinders efficient operation of both cluster and accelerator […]
Apr, 19
HPP-Controller: An intra-node controller designed for connecting heterogeneous CPUs
Heterogeneity is considered as a solution for supercomputers to scale to petascale. Many systems which are composed of general CPUs and special processing units such as Cells, GPGPUs and FPGAs have been implemented. In these systems, CPU needs interact with special processing units to process data together, thus communications between these heterogeneous processing units become […]
Apr, 19
Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?
To extend the exponential performance scaling of future chip multiprocessors, improving energy efficiency has become a first-class priority. Single-chip heterogeneous computing has the potential to achieve greater energy efficiency by combining traditional processors with unconventional cores (U-cores) such as custom logic, FPGAs, or GPGPUs. Although U-cores are effective at increasing performance, their benefits can also […]
Apr, 19
Parallel Approaches for SWAMP Sequence Alignment
This document is a summary and overview of several approaches to implement the local sequence alignment algorithms known as SWAMP and SWAMP+ on commercially available hardware. Using a Smith-Waterman style of alignment, these parallel algorithms have several innovative extensions that take advantage of the ASC associative computing model while maintaining speed, accuracy, and producing a […]
Apr, 19
A Hybrid Analytical DRAM Performance Model
As process technology scales, the number of transistors that can fit in a unit area has increased exponentially. Processor throughput, memory storage, and memory throughput have all been increasing at an exponential pace. As such, DRAM has become an ever-tightening bottleneck for applications with irregular memory access patterns. Computer architects in industry sometimes use ad […]
Apr, 19
Extending the Scalability of Single Chip Stream Processors with On-chip Caches
As semiconductor scaling continues, more transistors can be put onto the same chip despite growing challenges in clock frequency scaling. Stream processor architectures can make effective use of these additional resources for appropriate applications. However, it is important that programmer effort be amortized across future generations of stream processor architectures. Current industry projections suggest a […]
Apr, 19
A Task-centric Memory Model for Scalable Accelerator Architectures
This paper presents a task-centric memory model for 1000-core compute accelerators. Visual computing applications are emerging as an important class of workloads that can exploit 1000-core processors. In these workloads, we observe data sharing and communication patterns that can be leveraged in the design of memory systems for future 1000-core processors. Based on these insights, […]