18796

Posts

Mar, 17

Analyzing GPU Tensor Core Potential for Fast Reductions

The Nvidia GPU architecture has introduced new computing elements such as the tensor cores, which are special processing units dedicated to perform fast matrix-multiply-accumulate (MMA) operations and accelerate Deep Learning applications. In this work we present the idea of using tensor cores for a different purpose such as the parallel arithmetic reduction problem, and propose […]
Mar, 17

TensorFlow Doing HPC

TensorFlow is a popular emerging open-source programming framework supporting the execution of distributed applications on heterogeneous hardware. While TensorFlow has been initially designed for developing Machine Learning (ML) applications, in fact TensorFlow aims at supporting the development of a much broader range of application kinds that are outside the ML domain and can possibly include […]
Mar, 10

Improving GPU Performance through Instruction Redistribution and Diversification

As throughput-oriented accelerators, GPUs provide tremendous processing power by executing a massive number of threads in parallel. However, exploiting high degrees of thread-level parallelism (TLP) does not always translate to the peak performance that GPUs can offer, leaving the GPUs resources often under-utilized. Compared to compute resources, memory resources can tolerate considerably lower levels of […]
Mar, 10

Energy Efficient Parallel K-Means Clustering for an Intel Hybrid Multi-Chip Package

FPGA devices have been proving to be good candidates to accelerate applications from different research topics. For instance, machine learning applications such as K-Means clustering usually relies on large amount of data to be processed, and, despite the performance offered by other architectures, FPGAs can offer better energy efficiency. With that in mind, Intel ® […]
Mar, 10

On the Portability of GPU-Accelerated Applications via Automated Source-to-Source Translation

Over the past decade, accelerator-based supercomputers have grown from 0% to 42% performance share on the TOP500. Ideally, GPUaccelerated code on such systems should be "write once, run anywhere," regardless of the GPU device (or for that matter, any parallel device, e.g., CPU or FPGA). In practice, however, portability can be significantly more limited due […]
Mar, 10

GraphVite: A High-Performance CPU-GPU Hybrid System for Node Embedding

Learning continuous representations of nodes is attracting growing interest in both academia and industry recently, due to their simplicity and effectiveness in a variety of applications. Most of existing node embedding algorithms and systems are capable of processing networks with hundreds of thousands or a few millions of nodes. However, how to scale them to […]
Mar, 10

Custom Code Generation for a Graph DSL

Graph algorithms are at the heart of several applications, and achieving high performance with them has become critical due to the tremendous growth of irregular data. However, irregular algorithms are quite challenging to parallelize automatically, due to access patterns influenced by the input graph, which is unavailable until execution. Former research has addressed this issue […]
Mar, 3

Using Compiler Directives for Performance Portability in Scientific Computing: Kernels from Molecular Simulation

Achieving performance portability for high-performance computing (HPC) applications in scientific fields has become an increasingly important initiative due to large differences in emerging supercomputer architectures. Here we test some key kernels from molecular dynamics (MD) to determine whether the use of the OpenACC directive-based programming model when applied to these kernels can result in performance […]
Mar, 3

Cooperative CPU, GPU, and FPGA heterogeneous execution with EngineCL

Heterogeneous systems are the core architecture of most of the High Performance Computing nodes, due to their excellent performance and energy efficiency. However, a key challenge that remains is programmability; specifically, releasing the programmer from the burden of managing data and devices with different architectures. To this end, we extend EngineCL to support FPGA devices. […]
Mar, 3

Application level energy measurements and models for hybrid platform with accelerators

High Performance Computing is essential to continued advancement in many scientific and engineering fields. In recent years, due to the scale of the platforms and the breakdown of laws which had long since supported rapid expansion, energy efficiency has emerged as a new design constraint on HPC platforms and applications. This constraint has increased the […]
Mar, 3

Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs

With the ubiquity of accelerators, such as FPGAs and GPUs, the complexity of high-performance programming is increasing beyond the skill-set of the average scientist in domains outside of computer science. It is thus imperative to decouple programming paradigms and architecture-specific implementation from the underlying scientific computations. We present the Stateful DataFlow multiGraph (SDFG), a data-centric […]
Mar, 3

cuSten – CUDA Finite Difference and Stencil Library

In this paper we present cuSten, a new library of functions to handle the implementation of 2D finite-difference/stencil programs in CUDA. cuSten wraps data handling, kernel calls and streaming into four easy to use functions that speed up development of numerical codes on GPU platforms. The paper also presents an example of this library applied […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: