Posts
Mar, 24
Accelerating ternary quantized convolutional neural networks using OpenCL for FPGA
FPGAs balance the reprogammability of CPUs and the performance of ASICs. They seem the perfect solution to increase the throughput of neural networks. However they must also prove to be highly competitive in a market dominated by GPUs. To achieve this, we focus on the strength of FPGAs that cannot be taken advantage of on […]
Mar, 24
Dissecting the NVidia Turing T4 GPU via Microbenchmarking
In 2019, the rapid rate at which GPU manufacturers refresh their designs, coupled with their reluctance to disclose microarchitectural details, is still a hurdle for those software designers who want to extract the highest possible performance. Last year, these very reasons motivated us to dissect the Volta GPU architecture using microbenchmarks. The introduction in August […]
Mar, 17
Novel Data-Partitioning Algorithms for Performance and Energy Optimization of Data-Parallel Applications on Modern Heterogeneous HPC Platforms
Heterogeneity has turned into one of the most profound and challenging characteristics of today’s HPC environments. Modern HPC platforms have become highly heterogeneous owing to the tight integration of multicore CPUs and accelerators (such as Graphics Processing Units, Intel Xeon Phis, or Field-Programmable Gate Arrays) empowering them to maximize the dominant objectives of performance and […]
Mar, 17
CLTestCheck: Measuring Test Effectiveness for GPU Kernels
Massive parallelism, and energy efficiency of GPUs, along with advances in their programmability with OpenCL and CUDA programming models have made them attractive for general-purpose computations across many application domains. Techniques for testing GPU kernels have emerged recently to aid the construction of correct GPU software. However, there exists no means of measuring quality and […]
Mar, 17
Performance Optimization of Memory Intensive Applications on FPGA Accelerator
Hardware accelerators are a fundamental part of modern high performance computing (HPC) systems due to their performance capabilities. The two most commonly used accelerators are GPUs and FPGAs. Despite the easier programmability and better memory performance of GPUs, generally FPGAs perform equally well for computationally challenging applications while dramatically reducing the energy consumption. Furthermore, with […]
Mar, 17
Analyzing GPU Tensor Core Potential for Fast Reductions
The Nvidia GPU architecture has introduced new computing elements such as the tensor cores, which are special processing units dedicated to perform fast matrix-multiply-accumulate (MMA) operations and accelerate Deep Learning applications. In this work we present the idea of using tensor cores for a different purpose such as the parallel arithmetic reduction problem, and propose […]
Mar, 17
TensorFlow Doing HPC
TensorFlow is a popular emerging open-source programming framework supporting the execution of distributed applications on heterogeneous hardware. While TensorFlow has been initially designed for developing Machine Learning (ML) applications, in fact TensorFlow aims at supporting the development of a much broader range of application kinds that are outside the ML domain and can possibly include […]
Mar, 10
Improving GPU Performance through Instruction Redistribution and Diversification
As throughput-oriented accelerators, GPUs provide tremendous processing power by executing a massive number of threads in parallel. However, exploiting high degrees of thread-level parallelism (TLP) does not always translate to the peak performance that GPUs can offer, leaving the GPUs resources often under-utilized. Compared to compute resources, memory resources can tolerate considerably lower levels of […]
Mar, 10
Energy Efficient Parallel K-Means Clustering for an Intel Hybrid Multi-Chip Package
FPGA devices have been proving to be good candidates to accelerate applications from different research topics. For instance, machine learning applications such as K-Means clustering usually relies on large amount of data to be processed, and, despite the performance offered by other architectures, FPGAs can offer better energy efficiency. With that in mind, Intel ® […]
Mar, 10
On the Portability of GPU-Accelerated Applications via Automated Source-to-Source Translation
Over the past decade, accelerator-based supercomputers have grown from 0% to 42% performance share on the TOP500. Ideally, GPUaccelerated code on such systems should be "write once, run anywhere," regardless of the GPU device (or for that matter, any parallel device, e.g., CPU or FPGA). In practice, however, portability can be significantly more limited due […]
Mar, 10
Custom Code Generation for a Graph DSL
Graph algorithms are at the heart of several applications, and achieving high performance with them has become critical due to the tremendous growth of irregular data. However, irregular algorithms are quite challenging to parallelize automatically, due to access patterns influenced by the input graph, which is unavailable until execution. Former research has addressed this issue […]
Mar, 10
GraphVite: A High-Performance CPU-GPU Hybrid System for Node Embedding
Learning continuous representations of nodes is attracting growing interest in both academia and industry recently, due to their simplicity and effectiveness in a variety of applications. Most of existing node embedding algorithms and systems are capable of processing networks with hundreds of thousands or a few millions of nodes. However, how to scale them to […]