Posts
Oct, 30
Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU architectures
Recently, general purpose GPU (GPGPU) programming has spread rapidly after CUDA was first introduced to write parallel programs in high-level languages for NVIDIA GPUs. While a GPU exploits data parallelism very effectively, task-level parallelism is exploited as a multi-threaded program on a multicore CPU. For such a heterogeneous platform that consists of a multicore CPU […]
Oct, 30
Accelerating Real-time processing of the ATST Adaptive Optics System using Coarse-grained Parallel Hardware Architectures
The real-time processing of the four meter Advanced Technology Solar Telescope (ATST) adaptive optics (AO) system with approximately 1750 sub-apertures and 1900 actuators requires massive parallel processing to complete the task. The parallel processing is harnessed with the addition of hardware accelerators such as Field Programmable Gate Array (FPGA) and Graphics Processing Unit (GPU). We […]
Oct, 30
A scalable hybrid algorithm based on domain decomposition and algebraic multigrid for solving partial differential equations on a cluster of CPU/GPUs
Several of the top ranked supercomputers are based on the hybrid architecture consisting of a large number of CPUs and GPUs. Very high performance has been obtained for problems with special structures, such as FFT-based image processing or N-body based particle calculations. However, for the class of problems described by partial differential equations discretized by […]
Oct, 30
Efficient Implementation of the eta_T Pairing on GPU
Recently, efficient implementation of cryptographic algorithms on graphics processing units (GPUs) has attracted a lot of attention in the cryptologic research community. In this paper, we deal with efficient implementation of the $eta_T$ pairing on supersingular curves over finite fields of characteristics 3. We report the performance results of implementations on NVIDIA GTX 285, GTX […]
Oct, 30
Optimizing and Auto-tuning Belief Propagation on the GPU
A CUDA kernel will utilize high-latency local memory for storage when there are not enough registers to hold the required data or if the data is an array that is accessed using a variable index within a loop. However, accesses from local memory take longer than accesses from registers and shared memory, so it is […]
Oct, 30
Effective Parallelization of Non-bonded Interactions Kernel for Virtual Screening on GPUs
In this work we discuss the benefits of using massively parallel architectures for the optimization of Virtual Screening methods. We empirically demonstrate that GPUs are well suited architecture for the acceleration of non-bonded interaction kernels, obtaining up to a 260 times sustained speedup compared to its sequential counterpart version.
Oct, 30
Intermediate Language Extensions for Parallelism
An Intermediate Language (IL) specifies a program at a level of abstraction that includes precise semantics for state updates and control flow, but leaves unspecified the low-level software and hardware mechanisms that will be used to implement the semantics. Past ILs have followed the von Neumann execution model by making sequential execution the default, and […]
Oct, 30
High-Level Synthesis for FPGAs: From Prototyping to Deployment
Escalating system-on-chip design complexity is pushing the design community to raise the level of abstraction beyond register transfer level. Despite the unsuccessful adoptions of early generations of commercial high-level synthesis (HLS) systems, we believe that the tipping point for transitioning to HLS msystem-on-chip design complexityethodology is happening now, especially for field-programmable gate array (FPGA) designs. […]
Oct, 30
Exploring Many-Core Design Templates for FPGAs and ASICs
We present a highly productive approach to hardware design based on a many-coremicroarchitectural template used to implement compute-bound applications expressed in a high-level data-parallel language such as OpenCL. The template is customized on a per-application basis via a range of high-level parameters such as the interconnect topology or processing element architecture. The key benefits of […]
Oct, 30
Improving Energy Efficiency of GPU based General-Purpose Scientific Computing through Automated Selection of Near Optimal Configurations
Modern GPUs have been rapidly and increasingly used as a powerful engine for a variety of general-purpose computing applications due to their enormous parallelism and throughput capabilities. However, GPU power consumption still remains high since more and more transistors are integrated into its chip. Until now, how to increase and optimize energy efficiency (e.g., performance-per-Watt […]
Oct, 30
Leveraging Binary Translation for Heterogeneous Profiling
Heterogeneous systems, such as those including a graphics processor for general computation, are becoming increasingly common. While this increases the potential computing power that can be leveraged, it also increases the complexity of the system. This in turn increases the complexity of understanding behavior of the system, which is important when developing new software as […]
Oct, 30
GPU Computations in Heterogeneous Grid Environments
This thesis describes how the performance of job management systems on heterogeneous computing grids can be increased with Graphics Processing Units (GPU). The focus lies on describing what is required to extend the grid to support the Open Computing Language (OpenCL) and how an OpenCL application can be implemented for the heterogeneous grid. Additionally, already […]