14253

Posts

Jul, 13

Evaluating the capabilities of the Xeon Phi platform in the context of software-only, thread-level speculation

Intel Xeon Phi accelerators are one of the newest devices used in the field of parallel computing. However, there are comparatively few studies concerning their performance when using most of the existing parallelization techniques. One of them is thread-level speculation, a technique that optimistically tries to extract parallelism of loops without the need of a […]
Jul, 13

Optimization, Specification and Verification of the Prefix Sum Program in an OpenCL Environment

The Prefix Sum is an algorithm used as a building block for various other algorithms, for example radix sort, quicksort and lexically comparing strings. Implementing the Prefix Sum algorithm on the CPU is trivial, but a parallel approach with OpenCL is more complicated. An implementation in OpenCL has been made, and optimized to minimize branch […]
Jul, 13

PLB-HeC: A Profile-based Load-Balancing algorithm for Heterogeneous CPU-GPU Clusters

The use of GPU clusters for scientific applications in areas such as physics, chemistry and bioinformatics is becoming more widespread. These clusters frequently have different types of processing devices, such as CPUs and GPUs, which can themselves be heterogeneous. To use these devices in an efficient manner, it is crucial to find the right amount […]
Jul, 13

A Study of Data Partitioning on OpenCL-based FPGAs

A lot of research efforts have been devoted to accelerating relational database applications on FPGAs, due to their high energy efficiency and high throughput. Most of the existing studies are based on hardware description languages (HDLs). Recently, FPGA vendors have started to develop OpenCL SDKs for much better programmability. In this paper, we investigate the […]
Jul, 10

Characterizing and Optimizing Irregular Applications on Graphics Processing Units

In recent years, GPGPUs have experienced tremendous growth as general-purpose and high-throughput computing devices. Applications from various domains achieve significant speedups using GPGPUs. However, irregular applications do not perform well due to the mismatches between irregular problem structures and SIMD-like GPU architectures. The lack of in-depth characterization and quantifying the ways in which irregular applications […]
Jul, 10

Contributions to Music Semantic Analysis and Its Acceleration Techniques

Digitalized music production exploded in the past decade. Huge amount of data drives the development of effective and efficient methods for automatic music analysis and retrieval. This thesis focuses on performing semantic analysis of music, in particular mood and genre classification, with low level and mid level features since the mood and genre are among […]
Jul, 10

Many-Core Compiler Fuzzing

We address the compiler correctness problem for many-core systems through novel applications of fuzz testing to OpenCL compilers. Focusing on two methods from prior work, random differential testing and testing via equivalence modulo inputs (EMI), we present several strategies for random generation of deterministic, communicating OpenCL kernels, and an injection mechanism that allows EMI testing […]
Jul, 10

Towards Good Practices for Very Deep Two-Stream ConvNets

Deep convolutional networks have achieved great success for object recognition in still images. However, for action recognition in videos, the improvement of deep convolutional networks is not so evident. We argue that there are two reasons that could probably explain this result. First the current network architectures (e.g. Two-stream ConvNets) are relatively shallow compared with […]
Jul, 10

Multiple String Matching on a GPU using CUDAs

Multiple pattern matching algorithms are used to locate the occurrences of patterns from a finite pattern set in a large input string. Aho-Corasick, Set Horspool, Set Backward Oracle Matching, Wu-Manber and SOG, five of the most well known algorithms for multiple matching require an increased computing power, particularly in cases where large-size datasets must be […]
Jul, 8

Learning Better Encoding for Approximate Nearest Neighbor Search with Dictionary Annealing

We introduce a novel dictionary optimization method for high-dimensional vector quantization employed in approximate nearest neighbor (ANN) search. Vector quantization methods first seek a series of dictionaries, then approximate each vector by a sum of elements selected from these dictionaries. An optimal series of dictionaries should be mutually independent, and each dictionary should generate a […]
Jul, 8

Model-based optimization of MPDATA on Intel Xeon Phi through load imbalancing

Load balancing is a widely accepted technique for performance optimization of scientific applications on parallel architectures. Indeed, balanced applications do not waste processor cycles on waiting at points of synchronization and data exchange, maximizing this way the utilization of processors. In this paper, we challenge the universality of the load-balancing approach to optimization of the […]
Jul, 8

Autotuning OpenACC Work Distribution via Direct Search

OpenACC provides a high-productivity API for programming GPUs and similar accelerator devices. One of the last steps in tuning OpenACC programs is selecting values for the num_gangs and vector length clauses, which control how a parallel workload is distributed to an accelerator’s processing units. In this paper, we present OptACC, an autotuner that can assist […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: