14248

Posts

Jul, 10

Characterizing and Optimizing Irregular Applications on Graphics Processing Units

In recent years, GPGPUs have experienced tremendous growth as general-purpose and high-throughput computing devices. Applications from various domains achieve significant speedups using GPGPUs. However, irregular applications do not perform well due to the mismatches between irregular problem structures and SIMD-like GPU architectures. The lack of in-depth characterization and quantifying the ways in which irregular applications […]
Jul, 10

Contributions to Music Semantic Analysis and Its Acceleration Techniques

Digitalized music production exploded in the past decade. Huge amount of data drives the development of effective and efficient methods for automatic music analysis and retrieval. This thesis focuses on performing semantic analysis of music, in particular mood and genre classification, with low level and mid level features since the mood and genre are among […]
Jul, 10

Towards Good Practices for Very Deep Two-Stream ConvNets

Deep convolutional networks have achieved great success for object recognition in still images. However, for action recognition in videos, the improvement of deep convolutional networks is not so evident. We argue that there are two reasons that could probably explain this result. First the current network architectures (e.g. Two-stream ConvNets) are relatively shallow compared with […]
Jul, 10

Multiple String Matching on a GPU using CUDAs

Multiple pattern matching algorithms are used to locate the occurrences of patterns from a finite pattern set in a large input string. Aho-Corasick, Set Horspool, Set Backward Oracle Matching, Wu-Manber and SOG, five of the most well known algorithms for multiple matching require an increased computing power, particularly in cases where large-size datasets must be […]
Jul, 8

Learning Better Encoding for Approximate Nearest Neighbor Search with Dictionary Annealing

We introduce a novel dictionary optimization method for high-dimensional vector quantization employed in approximate nearest neighbor (ANN) search. Vector quantization methods first seek a series of dictionaries, then approximate each vector by a sum of elements selected from these dictionaries. An optimal series of dictionaries should be mutually independent, and each dictionary should generate a […]
Jul, 8

Model-based optimization of MPDATA on Intel Xeon Phi through load imbalancing

Load balancing is a widely accepted technique for performance optimization of scientific applications on parallel architectures. Indeed, balanced applications do not waste processor cycles on waiting at points of synchronization and data exchange, maximizing this way the utilization of processors. In this paper, we challenge the universality of the load-balancing approach to optimization of the […]
Jul, 8

Autotuning OpenACC Work Distribution via Direct Search

OpenACC provides a high-productivity API for programming GPUs and similar accelerator devices. One of the last steps in tuning OpenACC programs is selecting values for the num_gangs and vector length clauses, which control how a parallel workload is distributed to an accelerator’s processing units. In this paper, we present OptACC, an autotuner that can assist […]
Jul, 8

Experiments on Parallel Training of Deep Neural Network using Model Averaging

In this work we apply model averaging to parallel training of deep neural network (DNN). Parallelization is done in a model averaging manner. Data is partitioned and distributed to different nodes for local model updates, and model averaging across nodes is done every few minibatches. We use multiple GPUs for data parallelization, and Message Passing […]
Jul, 8

Sorting and Permuting without Bank Conflicts on GPUs

In this paper, we look at the complexity of designing algorithms without any bank conflicts in the shared memory of Graphical Processing Units (GPUs). Given input of size $n$, $w$ processors and $w$ memory banks, we study three fundamental problems: sorting, permuting and $w$-way partitioning (defined as sorting an input containing exactly $n/w$ copies of […]
Jul, 6

High Performance Extreme Learning Machines: A Complete Toolbox for Big Data Applications

This work presents a complete approach to a successful utilization of a high performance Extreme Learning Machines (ELMs) Toolbox for Big Data. It summarizes recent advantages in algorithmic performance; gives a fresh view on the ELM solution in relation to the traditional linear algebraic performance; and reaps the latest software and hardware performance achievements. The […]
Jul, 6

Best bang for your buck: GPU nodes for GROMACS biomolecular simulations

The molecular dynamics simulation package GROMACS runs efficiently on a wide variety of hardware from commodity workstations to high performance computing clusters. Hardware features are well exploited with a combination of SIMD, multi-threading, and MPI-based SPMD/MPMD parallelism, while GPUs can be used as accelerators to compute interactions offloaded from the CPU. Here we evaluate which […]
Jul, 6

LTTng CLUST: A system-wide unified CPU and GPU tracing tool for OpenCL applications

As computation schemes evolve and many new tools become available to programmers to enhance the performance of their applications, many programmers started to look towards highly parallel platforms such as Graphical Processing Unit (GPU). Offloading computations that can take advantage of the architecture of the GPU is a technique that has proven fruitful in recent […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us:

contact@hpgu.org