17681

Posts

Oct, 21

Revisiting the Case of ARM SoCs in High-Performance Computing Clusters

Over the course of the past decade, the explosive popularity of embedded devices such as smartphones and tablets have given rise to ARM SoCs, whose characteristically low power consumption have made them ideal for these types of embedded devices. Recent maturation in the ARM SoC market, which has seen the advent of more powerful 64-bit […]
Oct, 21

Parallel Matching and Clustering Algorithms on GPUs

The main focus of this thesis is on developing efficient algorithms on GPUs for certain matching and clustering problems. Through extensive experiments we show that sparse and unstructured problems can benefit greatly from using GPUs as long as the algorithms are carefully designed. Even though none of the presented algorithms are fundamentally new, they still […]
Oct, 21

Computation of gray-level co-occurrence matrix based on CUDA and its optimization

As in various fields like scientific research and industrial application, the computation time optimization is becoming a task that is of increasing importance because of its highly parallel architecture. The graphics processing unit is regarded as a powerful engine for application programs that demand fairly high computation capabilities. Based on this, an algorithm was introduced […]
Oct, 21

How to distribute most efficiently a computation intensive calculation on an Android device to external compute units with an Android API

Is transferring computation intensive calculations to external compute-units the next trend? This master’s thesis researches if it is worth the effort to transfer a matrix multiplication from an Android phone to a System-on-Chip (SoC), using Bluetooth or WebSocket as communication protocols. The SoC solution used in this work is an Intel Altera Cyclone V based […]
Oct, 15

Accelerating Genomics Research with OpenCL and FPGAs

With the rapid decrease in gene sequencing costs due to the emergence of second-generation sequencing equipment, the availability of genome sequence data is increasing dramatically. The ability to correlate the variations among genomes is enabling advances in a wide range of medical research and personalized care. Because each human genome comprises more than three billion […]
Oct, 15

Flexible FPGA design for FDTD using OpenCL

Compared to classical HDL designs, generating FPGA with high-level synthesis from an OpenCL specification promises easier exploration of different design alternatives and, through ready-to-use infrastructure and common abstractions for host and memory interfaces, easier portability between different FPGA families. In this work, we evaluate the extent of this promise. To this end, we present a […]
Oct, 15

Toward Performance Portability for CPUs and GPUs Through Algorithmic Compositions

The diversity of microarchitecture designs in heterogeneous computing systems allows programs to achieve high performance and energy efficiency, but results in substantial software redevelopment cost for each type or generation of hardware. To mitigate this cost, a performance portable programming system is required. This work presents my solution to the performance portability problem. I argue […]
Oct, 15

Synkhronos: a Multi-GPU Theano Extension for Data Parallelism

We present Synkhronos, an extension to Theano for multi-GPU computations leveraging data parallelism. Our framework provides automated execution and synchronization across devices, allowing users to continue to write serial programs without risk of race conditions. The NVIDIA Collective Communication Library is used for high-bandwidth inter-GPU communication. Further enhancements to the Theano function interface include input […]
Oct, 15

SoAx: A generic C++ Structure of Arrays for handling Particles in HPC Codes

The numerical study of physical problems often require integrating the dynamics of a large number of particles evolving according to a given set of equations. Particles are characterized by the information they are carrying such as an identity, a position other. There are generally speaking two different possibilities for handling particles in high performance computing […]
Oct, 3

Energy efficiency of finite difference algorithms on multicore CPUs, GPUs, and Intel Xeon Phi processors

In addition to hardware wall-time restrictions commonly seen in high-performance computing systems, it is likely that future systems will also be constrained by energy budgets. In the present work, finite difference algorithms of varying computational and memory intensity are evaluated with respect to both energy efficiency and runtime on an Intel Ivy Bridge CPU node, […]
Oct, 3

FPGA implementation of a Convolutional Neural Network for "Wake up word" detection

The popularity of machine learning has increased dramatically in the last years and the possible applications varies from web search, speech recognition, object detection, etc. A big part of this development is due to the use of Convolutional Neural Networks (CNNs), where high performance Graphics Processing Units (GPUs) has been the most popular device. This […]
Oct, 3

An Efficient Load Balancing Method for Tree Algorithms

Nowadays, multiprocessing is mainstream with exponentially increasing number of processors. Load balancing is, therefore, a critical operation for the efficient execution of parallel algorithms. In this paper we consider the fundamental class of tree-based algorithms that are notoriously irregular, and hard to load-balance with existing static techniques. We propose a hybrid load balancing method using […]
Page 5 of 935« First...34567...102030...Last »

Recent source codes

* * *

* * *

HGPU group © 2010-2017 hgpu.org

All rights belong to the respective authors

Contact us: