16044

Posts

Jun, 28

Accelerating High-Throughput Computing through OpenCL

As the computational trend diverges from standard CPU computing, to encompass GPUs and other accelerators, the need to integrate these unused resources within existing systems becomes apparent. This paper presents the implementation of a HTCondor pool with GPU execution capabilities through OpenCL. Implementation is discussed from both the system setup and the software design standpoint. […]
Jun, 28

GPU Based Real-Time Welding Simulation with Smoothed-Particle Hydrodynamics

Welding training is essential in the development of industrialization. A good welder will build robust workpieces that ensure the safety and stability of the product. However, training a welder requires lots of time and access professional welding equipment. Therefore, it is desirable to have a training system that is economical and easy to use. After […]
Jun, 22

Efficient and portable multi-tasking for heterogeneous systems

Modern computing systems comprise heterogeneous designs which combine multiple and diverse architectures on a single system. These designs provide potentials for high performance under reduced power requirements but require advanced resource management and workload scheduling across the available processors. Programmability frameworks, such as OpenCL and CUDA, enable resource management and workload scheduling on heterogeneous systems. […]
Jun, 22

Efficient and High-quality Sparse Graph Coloring on the GPU

Graph coloring has been broadly used to discover concurrency in parallel computing. To speedup graph coloring for large-scale datasets, parallel algorithms have been proposed to leverage modern GPUs. Existing GPU implementations either have limited performance or yield unsatisfactory coloring quality (too many colors assigned). We present a work-efficient parallel graph coloring implementation on GPUs with […]
Jun, 22

Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor contractions constitute a key computational ingredient of numerical multi-linear algebra. However, as the order and dimension of tensors grow, the time and space complexities of tensor-based computations grow quickly. Existing approaches for tensor contractions typically involves explicit copy and transpose operations. In this paper, we propose and evaluate a new BLAS-like primitive STRIDEDBATCHEDGEMM that […]
Jun, 22

CNNLab: a Novel Parallel Framework for Neural Networks using GPU and FPGA-a Practical Study with Trade-off Analysis

Designing and implementing efficient, provably correct parallel neural network processing is challenging. Existing high-level parallel abstractions like MapReduce are insufficiently expressive while low-level tools like MPI and Pthreads leave ML experts repeatedly solving the same design challenges. However, the diversity and large-scale data size have posed a significant challenge to construct a flexible and high-performance […]
Jun, 22

Soft GPGPUs for Embedded FPGAs: An Architectural Evaluation

We present a customizable soft architecture which allows for the execution of GPGPU code on an FPGA without the need to recompile the design. Issues related to scaling the overlay architecture to multiple GPGPU multiprocessors are considered along with application-class architectural optimizations. The overlay architecture is optimized for FPGA implementation to support efficient use of […]
Jun, 21

YodaNN: An Ultra-Low Power Convolutional Neural Network Accelerator Based on Binary Weights

Convolutional Neural Networks (CNNs) have revolutionized the world of image classification over the last few years, pushing the computer vision close beyond human accuracy. The required computational effort of CNNs today requires power-hungry parallel processors and GP-GPUs. Recent efforts in designing CNN Application-Specific Integrated Circuits (ASICs) and accelerators for System-On-Chip (SoC) integration have achieved very […]
Jun, 21

Combinatorial Optimization of Work Distribution on Heterogeneous Systems

We describe an approach that uses combinatorial optimization and machine learning to share the work between the host and device of heterogeneous computing systems such that the overall application execution time is minimized. We propose to use combinatorial optimization to search for the optimal system configuration in the given parameter space (such as, the number […]
Jun, 21

cltorch: a Hardware-Agnostic Backend for the Torch Deep Neural Network Library, Based on OpenCL

This paper presents cltorch, a hardware-agnostic backend for the Torch neural network framework. cltorch enables training of deep neural networks on GPUs from diverse hardware vendors, including AMD, NVIDIA, and Intel. cltorch contains sufficient implementation to run models such as AlexNet, VGG, Overfeat, and GoogleNet. It is written using the OpenCL language, a portable compute […]
Jun, 21

Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump Using CUDA-enabled GPU Hardware

This paper focuses on the anticipatory enhancement of methods of detecting stealth software. Cyber security detection tools are insufficiently powerful to reveal the most recent cyber-attacks which use malware. In this paper, we will present first an idea of the highest stealth malware, as this is the most complicated scenario for detection because it combines […]
Jun, 21

A Parallel Algorithm for LZW Decompression, with GPU Implementation

The main contribution of this paper is to present a parallel algorithm for LZW decompression and to implement it in a CUDA-enabled GPU. Since sequential LZW decompression creates a dictionary table by reading codes in a compressed file one by one, its parallelization is not an easy task. We first present a parallel LZW decompression […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us:

contact@hpgu.org