Posts
Sep, 27
GPU-TLS: An Efficient Runtime for Speculative Loop Parallelization on GPUs
Recently GPUs have risen as one important parallel platform for general purpose applications, both in HPC and cloud environments. Due to the special execution model, developing programs for GPUs is difficult even with the recent introduction of high-level languages like CUDA and OpenCL. To ease the programming efforts, some research has proposed automatically generating parallel […]
Sep, 27
Java with Auto-Parallelization on Graphics Coprocessing Architecture
GPU-based many-core accelerators have gained a footing in supercomputing. Their widespread adoption yet hinges on better parallelization and load scheduling techniques to utilize the hybrid system of CPU and GPU cores easily and efficiently. This paper introduces a new userfriendly compiler framework and runtime system, dubbed Japonica, to help Java applications harness the full power […]
Sep, 27
Multi-Scale, Multi-Level, Heterogeneous Features Extraction and Classification of Volumetric Medical Images
This paper articulates a novel method for the heterogeneous feature extraction and classification directly on volumetric images, which covers multi-scale point feature, multi-scale surface feature, multi-level curve feature, and blob feature. To tackle the challenge of complex volumetric inner structure and diverse feature forms, our technical solution hinges upon the integrated approach of locally-defined diffusion […]
Sep, 27
Trellis: Portability Across Architectures with a High-level Framework
The increasing computational needs of parallel applications inevitably require portability across parallel architectures, which now include heterogeneous processing resources, such as CPUs and GPUs, and multiple SIMD/SIMT widths. However, the lack of a common parallel programming paradigm that provides predictable, near-optimal performance on each resource leads to the use of low-level frameworks with architecture-specific optimizations, […]
Sep, 25
OpenCL Task Partitioning in the Presence of GPU Contention
Heterogeneous multi- and many-core systems are increasingly prevalent in the desktop and mobile domains. On these systems it is common for programs to compete with co-running programs for resources. While multi-task scheduling for CPUs is a well-studied area, how to partitioning and map computing tasks onto the hetergeneous system in the presence of GPU contention […]
Sep, 25
Large Graphs on multi-GPUs
We studied the problem of developing an efficient BFS algorithm to explore large graphs having billions of nodes and edges. The size of the problem requires a parallel computing architecture. We proposed a new algorithm that performs a distributed BFS and the corresponding implementation on multiGPUs clusters. As far as we know, this is the […]
Sep, 25
Automatic Software Synthesis from High-Level ForSyDe Models Targeting Massively Parallel Processors
In the past decade we have witnessed an abrupt shift to parallel computing subsequent to the increasing demand for performance and functionality that can no longer be satisfied by conventional paradigms. As a consequence, the abstraction gab between the applications and the underlying hardware increased, triggering both industry and academia in several research directions. This […]
Sep, 25
Fast k-NNG construction with GPU-based quick multi-select
In this paper we describe a new brute force algorithm for building the k-Nearest Neighbor Graph (k-NNG). The k-NNG algorithm has many applications in areas such as machine learning, bioinformatics, and clustering analysis. While there are very efficient algorithms for data of low dimensions, for high dimensional data the brute force search is the best […]
Sep, 25
The density matrix renormalization group algorithm on kilo-processor architectures: implementation and trade-offs
In the numerical analysis of strongly correlated quantum lattice models one of the leading algorithms developed to balance the size of the effective Hilbert space and the accuracy of the simulation is the density matrix renormalization group (DMRG) algorithm, in which the run-time is dominated by the iterative diagonalization of the Hamilton operator. As the […]
Sep, 24
Evaluation of autoparallelization toolkits for commodity graphics hardware
In this paper we evaluate the performance of the OpenACC and Mint toolkits against C and CUDA implementations of the standard PolyBench test suite. Our analysis reveals that performance is similar in many cases, but that a certain set of code constructs impede the ability of Mint to generate optimal code. We then present some […]
Sep, 24
Pipeline strategies to accelerate range query processing on a multi-GPU environment
Nowadays, similarity search is becoming a field of increasing interest because these kinds of methods can be applied to different areas in computer science and engineering, such as voice and image recognition, text retrieval, and many others. However, when processing large volumes of data, query response time can be quite high. In this case, it […]
Sep, 24
Realtime Deformation of Constrained Meshes Using GPU
Constrained meshes play an important role in freeform architectural design, as they can represent panel layouts on freeform surfaces. It is challenging to perform realtime manipulation on such meshes, because all constraints need to be respected during the deformation while the shape quality needs to be maintained. This usually leads to nonlinear constrained optimization problems, […]