Posts
Oct, 27
Compiling and Optimizing Java 8 Programs for GPU Execution
GPUs can enable significant performance improvements for certain classes of data parallel applications and are widely used in recent computer systems. However, GPU execution currently requires explicit low-level operations such as 1) managing memory allocations and transfers between the host system and the GPU, 2) writing GPU kernels in a low-level programming model such as […]
Oct, 27
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
High-level languages such as Java increase both productivity and portability with productive language features such as managed runtime, type safety, and precise exception semantics. Additionally, Java 8 provides parallel stream APIs with lambda expressions to facilitate parallel programming for mainstream users of multi-core CPUs and many-core GPUs. These high-level APIs avoid the complexity of writing […]
Oct, 27
Overlap fermions on GPUs
We report on our efforts to implement overlap fermions on NVIDIA GPUs using CUDA, commenting on the algorithms used, implemetation details, and the performance of our code.
Oct, 25
ZNN – A Fast and Scalable Algorithm for Training 3D Convolutional Networks on Multi-Core and Many-Core Shared Memory Machines
Convolutional networks (ConvNets) have become a popular approach to computer vision. It is important to accelerate ConvNet training, which is computationally costly. We propose a novel parallel algorithm based on decomposition into a set of tasks, most of which are convolutions or FFTs. Applying Brent’s theorem to the task dependency graph implies that linear speedup […]
Oct, 25
Execution of Compound Multi-Kernel OpenCL Computations in Multi-CPU/Multi-GPU Environments
Current computational systems are heterogeneous by nature, featuring a combination of CPUs and GPUs. As the latter are becoming an established platform for high-performance computing, the focus is shifting towards the seamless programming of these hybrid systems as a whole. The distinct nature of the architectural and execution models in place raises several challenges, as […]
Oct, 25
Multi-GPU Distributed Parallel Bayesian Differential Topic Modelling
There is an explosion of data, documents, and other content, and people require tools to analyze and interpret these, tools to turn the content into information and knowledge. Topic modeling have been developed to solve these problems. Topic models such as LDA [Blei et. al. 2003] allow salient patterns in data to be extracted automatically. […]
Oct, 25
Modern Gyrokinetic Particle-In-Cell Simulation of Fusion Plasmas on Top Supercomputers
The Gyrokinetic Toroidal Code at Princeton (GTC-P) is a highly scalable and portable particle-in-cell (PIC) code. It solves the 5D Vlasov-Poisson equation featuring efficient utilization of modern parallel computer architectures at the petascale and beyond. Motivated by the goal of developing a modern code capable of dealing with the physics challenge of increasing problem size […]
Oct, 25
Join Algorithms on GPUs: A Revisit After Seven Years
Implementing database operations on parallel platforms has gain a lot of momentum in the past decade. A number of studies have shown the potential of using GPUs to speed up database operations. In this paper, we present empirical evaluations of a state-of-the-art work published in SIGMOD’08 on GPU-based join processing. In particular, such work provides […]
Oct, 22
Sequential Code Parallelization for Multi-core Embedded Systems: A Survey of Models, Algorithms and Tools
In recent years the industry experienced a shift in the design and manufacture of processors. Multiple-core processors in one single chip started replacing the common used single-core processors. This design trend reached the develop of System-on-Chip, widely used in embedded systems, and turned them into powerful Multiprocessor System-on-Chip. These multi-core systems have presented not only […]
Oct, 22
A linguistic approach to concurrent, distributed, and adaptive programming across heterogeneous platforms
Two major trends in computing hardware during the last decade have been an increase in the number of processing cores found in individual computer hardware platforms and an ubiquity of distributed, heterogeneous systems. Together, these changes can improve not only the performance of a range of applications, but the types of applications that can be […]
Oct, 22
Stadium Hashing: Scalable and Flexible Hashing on GPUs
Hashing is one of the most fundamental operations that provides a means for a program to obtain fast access to large amounts of data. Despite the emergence of GPUs as many-threaded general purpose processors, high performance parallel data hashing solutions for GPUs are yet to receive adequate attention. Existing hashing solutions for GPUs not only […]
Oct, 22
BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing
Basic Linear Algebra Subprograms (BLAS) are a set of low level linear algebra kernels widely adopted by applications involved with the deep learning and scientific computing. The massive and economic computing power brought forth by the emerging GPU architectures drives interest in implementation of compute-intensive level 3 BLAS on multi-GPU systems. In this paper, we […]