Posts
Mar, 17
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performance in various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, several software frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context, reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be […]
Mar, 17
HPVM: Heterogeneous Parallel Virtual Machine
We propose a parallel program representation for heterogeneous systems, designed to enable performance portability across a wide range of popular parallel hardware, including GPUs, vector instruction sets, multicore CPUs and potentially FPGAs. Our representation, which we call HPVM, is a hierarchical dataflow graph with shared memory and vector instructions. HPVM supports three important capabilities for […]
Mar, 17
CuLDA_CGS: Solving Large-scale LDA Problems on GPUs
Latent Dirichlet Allocation(LDA) is a popular topic model. Given the fact that the input corpus of LDA algorithms consists of millions to billions of tokens, the LDA training process is very time-consuming, which may prevent the usage of LDA in many scenarios, e.g., online service. GPUs have benefited modern machine learning algorithms and big data […]
Mar, 17
Improved OpenCL-based Implementation of Social Field Pedestrian Model
Two aspects of improvements are proposed for the OpenCL-based implementation of the social field pedestrian model. In the aspect of algorithm, a method based on the idea of divide-and-conquer is devised in order to overcome the problem of global memory depletion when fields are of a larger size. This is of importance for the study […]
Mar, 17
NVIDIA Tensor Core Programmability, Performance & Precision
The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called "Tensor Core" that performs one matrix-multiply-and-accumulate on 4×4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program […]
Mar, 10
Portable Real-Time DCT Based Steganography Using OpenCL
In this paper a steganographic method for real time data hiding is proposed. The main goal of the research is to develop steganographic method with increased robustness to unintentional image processing attacks. In addition, we prove the validity of the method in real time applications. The method is based on a discrete cosine transform (DCT) […]
Mar, 10
OpenCL-ready High Speed FPGA Network for Reconfigurable High Performance Computing
Field programmable gate arrays (FPGAs) have gained attention in high-performance computing (HPC) research because their computation and communication capabilities have dramatically improved in recent years as a result of improvements to semiconductor integration technologies that depend on Moore’s Law. In addition to FPGA performance improvements, OpenCL-based FPGA development toolchains have been developed and offered by […]
Mar, 10
Multithreaded Dense Linear Algebra on Asymmetric Multi-core Processors
Nowadays, there exists a large variety of scientific, industry and engineering applications that require high computational power and storage, and their demands continue to grow; in order to obtain more precise solutions in these applications, scientists need to elaborate and work with more sophisticated and complex physical and mathematical models. Consequently, the capacity of new […]
Mar, 10
Programming Abstractions and Optimization Techniques for GPU-based Heterogeneous Systems
CPU/GPU heterogeneous systems have shown remarkable advantages in performance and energy consumption compared to homogeneous ones such as standard multi-core systems. Such heterogeneity represents one of the most promising trends for the near-future evolution of high performance computing hardware. However, as a double-edged sword, the heterogeneity also brings significant programming complexities that prevent the easy […]
Mar, 10
Chebyshev Filter Diagonalization on Modern Manycore Processors and GPGPUs
Chebyshev filter diagonalization is well established in quantum chemistry and quantum physics to compute bulks of eigenvalues of large sparse matrices. Choosing a block vector implementation, we investigate optimization opportunities on the new class of high-performance compute devices featuring both high-bandwidth and low-bandwidth memory. We focus on the transparent access to the full address space […]
Mar, 3
Technical Report about Tiramisu: a Three-Layered Abstraction for Hiding Hardware Complexity from DSL Compilers
High-performance DSL developers work hard to take advantage of modern hardware. The DSL compilers have to build their own complex middle-ends before they can target a common back-end such as LLVM, which only handles single instruction streams with SIMD instructions. We introduce Tiramisu, a common middle-end that can generate efficient code for modern processors and […]
Mar, 3
OpenCL Acceleration for TensorFlow
There is huge demand for targeting complex and large-scale machine learning applications particularly those based on popular actively-maintained frameworks such as TensorFlow and CAFFE to a variety of platforms with accelerators ranging from high-end desktop GPUs to resource-constrained embedded or mobile GPUs, FPGAs, and DSPs. However, to deliver good performance different platforms may require different […]