15465

Posts

Feb, 16

Finding, Measuring, and Reducing Inefficiencies in Contemporary Computer Systems

Computer systems have become increasingly diverse and specialized in recent years. This complexity supports a wide range of new computing uses and users, but is not without cost: it has become difficult to maintain the efficiency of contemporary general purpose computing systems. Computing inefficiencies, which include nonoptimal runtimes, excessive energy use, and limits to scalability, […]
Feb, 16

SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures

Modern servers have become heterogeneous, often combining multicore CPUs with many-core GPGPUs. Such heterogeneous architectures have the potential to improve the performance of data-intensive stream processing applications, but they are not supported by current relational stream processing engines. For an engine to exploit a heterogeneous architecture, it must execute streaming SQL queries with sufficient data-parallelism […]
Feb, 16

CaffeLink: Mathematica binding for Caffe Deep Learning Framework

In this paper we present CaffeLink an open-source library for Mathematica which is a binding of a well-established Caffe deep learning framework. Caffe is a highly-optimized CUDA accelerated library with focus on convolutional neural networks written in C++ with Python and Matlab bindings. CaffeLink is based upon Mathematica’s LibraryLink. It makes accessible most features of […]
Feb, 11

Writing a performance-portable matrix multiplication

There are several frameworks that, while providing functional portability of code across different platforms, do not automatically provide performance portability. As a consequence, programmers have to hand-tune the kernel codes for each device. The Heterogeneous Programming Library (HPL) is one of these libraries, but it has the interesting feature that the kernel codes, which implement […]
Feb, 10

Programming GPUs with C++14 and Just-In-Time Compilation

Systems that comprise accelerators (e.g., GPUs) promise high performance, but their programming is still a challenge, mainly because of two reasons: 1) two distinct programming models have to be used within an application: one for the host CPU (e.g., C++), and one for the accelerator (e.g., OpenCL or CUDA); 2) using Just-In-Time (JIT) compilation and […]
Feb, 10

BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

We introduce BinaryNet, a method which trains DNNs with binary weights and activations when computing parameters’ gradient. We show that it is possible to train a Multi Layer Perceptron (MLP) on MNIST and ConvNets on CIFAR-10 and SVHN with BinaryNet and achieve nearly state-of-the-art results. At run-time, BinaryNet drastically reduces memory usage and replaces most […]
Feb, 10

GPU-Accelerated High-Level Synthesis for Bitwidth Optimization of FPGA Datapaths

Bitwidth optimization of FPGA datapaths can save hardware resources by choosing the fewest number of bits required for each datapath variable to achieve a desired quality of result. However, it is an NP-hard problem that requires unacceptably long runtimes when using sequential CPU-based heuristics. We show how to parallelize the key steps of bitwidth optimization […]
Feb, 10

FARGO3D: A new GPU-oriented MHD code

We present the FARGO3D code, recently publicly released. It is a magnetohydrodynamics code developed with special emphasis on protoplanetary disks physics and planet-disk interactions, and parallelized with MPI. The hydrodynamics algorithms are based on finite difference upwind, dimensionally split methods. The magnetohydrodynamics algorithms consist of the constrained transport method to preserve the divergence-free property of […]
Feb, 10

Performance Portable GPU Code Generation for Matrix Multiplication

Parallel accelerators such as GPUs are notoriously hard to program; exploiting their full performance potential is a job best left for ninja programmers. High-level programming languages coupled with optimizing compilers have been proposed to attempt to address this issue. However, they rely on device-specific heuristics or hard-coded library implementations to achieve good performance resulting in […]
Feb, 9

Guided Profiling for Auto-Tuning Array Layouts on GPUs

Auto-tuning for Graphics Processing Units (GPUs) has become very popular in recent years. It removes the necessity to hand-tune GPU code especially when a new hardware architecture is released. Our auto-tuner optimizes memory access patterns. This is a key aspect to exploit the full performance of modern GPUs. As the memory hierarchy has historically changed […]
Feb, 8

Portable Programming Models for Heterogeneous Platforms

With the end of Dennard scaling and emergence of dark silicon, the bets are high on heterogeneous architectures to achieve both application performance and energy efficiency. However, diversity in heterogeneous architectures poses severe programming challenges in terms of data layout, memory coherence, task partitioning, data distribution, and sharing of virtual addresses. Existing high-level programming languages […]
Feb, 8

High performance high-order numerical methods: applications in ocean modeling

This thesis presents high-order numerical methods for time-dependent simulations of oceanic wave propagation on modern many-core hardware architecture. Simulation of the waves such as tsunami, is challenging because of the varying fluid depths, propagation in many regions, requirement of high resolution near the shore, complex nonlinear wave phenomenon, and necessity of faster than real-time predictions. […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us: