Posts
Apr, 7
Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures : Algorithms and Experiments
Architectures with multiple classes of memory media are becoming a common part of mainstream supercomputer deployments. So called multi-level memories offer differing characteristics for each memory component including variation in bandwidth, latency and capacity. This paper investigates the performance of sparse matrix multiplication kernels on two leading high-performance computing architectures — Intel’s Knights Landing processor […]
Mar, 31
HDArray: Parallel Array Interface for Distributed Heterogeneous Devices
Heterogeneous clusters with nodes containing one or more accelerators, such as GPUs, have become common. While MPI provides a mechanism and management of inter-address space communication, and OpenCL provides a way to manage computation and communication within a process with access to heterogeneous computational resources, programmers are forced to write hybrid programs that manage the […]
Mar, 31
A Comparison between GPU-based Volume Ray Casting Implementations: Fragment Shader, Compute Shader, OpenCL, and CUDA
Volume rendering is an important area of study in computer graphics, due to its application in areas such as medicine, physic simulations, oil and gas industries, and others. The main used method nowadays for volume rendering is ray casting. Nevertheless, there are a variety of parallel APIs that can be used to implement it. Thus, […]
Mar, 31
Face Recognition with Hybrid Efficient Convolution Algorithms on FPGAs
Deep Convolutional Neural Networks have become a Swiss knife in solving critical artificial intelligence tasks. However, deploying deep CNN models for latency-critical tasks remains to be challenging because of the complex nature of CNNs. Recently, FPGA has become a favorable device to accelerate deep CNNs thanks to its high parallel processing capability and energy efficiency. […]
Mar, 31
Python Non-Uniform Fast Fourier Transform (PyNUFFT): An Accelerated Non-Cartesian MRI Package on a Heterogeneous Platform (CPU/GPU)
A Python non-uniform fast Fourier transform (PyNUFFT) package has been developed to accelerate multidimensional non-Cartesian image reconstruction on heterogeneous platforms. Since scientific computing with Python encompasses a mature and integrated environment, the time efficiency of the NUFFT algorithm has been a major obstacle to real-time non-Cartesian image reconstruction with Python. The current PyNUFFT software enables […]
Mar, 31
Design Principles for Sparse Matrix Multiplication on the GPU
We implement two novel algorithms for sparse-matrix dense-matrix multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the popular compressed-sparse-row (CSR) format and thus do not require expensive format conversion. While previous SpMM work concentrates on thread-level parallelism, we additionally focus on latency hiding with instruction-level parallelism and load-balancing. We show, both […]
Mar, 25
Scalable Breadth-First Search on a GPU Cluster
On a GPU cluster, the ratio of high computing power to communication bandwidth makes scaling breadth-first search (BFS) on a scale-free graph extremely challenging. By separating high and low out-degree vertices, we present an implementation with scalable computation and a model for scalable communication for BFS and direction-optimized BFS. Our communication model uses global reduction […]
Mar, 25
Optimization of Hierarchical Matrix Computation on GPU
The demand for dense matrix computation in large scale and complex simulations is increasing; however, the memory capacity of current computer system is insufficient for such simulations. Hierarchical matrix method (H-matrices) is attracting attention as a computational method that can reduce the memory requirements of dense matrix computations. However, the computation of H-matrices is more […]
Mar, 25
A development of an accelerator board dedicated for multi-precision arithmetic operations and its application to Feynman loop integrals II
Evaluation of a wide variety of Feynman diagrams with multi-loop integrals and physical parameters and its comparison with high energy experiments are expected to investigate new physics beyond the Standard Model. We have been developing a direct computation method of multi-loop integrals of Feynman diagrams. One of features of our method is that we adopt […]
Mar, 25
MALBEC: a new CUDA-C ray-tracer in General Relativity
A new CUDA-C code for tracing orbits around non-charged black holes is presented. This code is named MALBEC, and take advantage of the graphic processing units and the CUDA platform in order to track the geodesic motion of null and timelike test particles in Schwarzschild and Kerr. Additionally, a new general set of equations that […]
Mar, 25
Accelerating CNN inference on FPGAs: A Survey
Convolutional Neural Networks (CNNs) are currently adopted to solve an ever greater number of problems, ranging from speech recognition to image classification and segmentation. The large amount of processing required by CNNs calls for dedicated and tailored hardware support methods. Moreover, CNN workloads have a streaming nature, well suited to reconfigurable hardware architectures such as […]
Mar, 22
The VOLNA-OP2 Tsunami Code (Version 1.0)
In this paper, we present the VOLNA-OP2 tsunami model and implementation; a finite volume non-linear shallow water equations (NSWE) solver built on the OP2 domain specific language for unstructured mesh computations. VOLNA-OP2 is unique among tsunami solvers in its support for several high performance computing platforms: CPUs, the Intel Xeon Phi, and GPUs. This is […]