high performance computing on graphics processing units: hgpu.org

Posts

Nov, 29

Semantic Segmentation of Colon Glands with Deep Convolutional Neural Networks and Total Variation Segmentation

Segmentation of histopathology sections is an ubiquitous requirement in digital pathology and due to the large variability of biological tissue, machine learning techniques have shown superior performance over standard image processing methods. As part of the GlaS@MICCAI2015 colon gland segmentation challenge, we present a learning-based algorithm to segment glands in tissue of benign and malignant […]

CUDA

Nov, 29

A Problem-Based Learning Approach to GPU Computing

Compared to CPUs, modern GPUs exhibit a high ratio of computing performance per watt, and so current supercomputer designs often include multiple racks of GPUs in order to achieve high teraflop counts at minimal energy cost. GPU programming is thus becoming increasingly important, and yet it remains a challenging task. This paper describes a course […]

OpenCL

Nov, 29

Orchestrating Multiple Data-Parallel Kernels on Multiple Devices

Traditionally, programmers and software tools have focused on mapping a single data-parallel kernel onto a heterogeneous computing system consisting of multiple general-purpose processors (CPUS) and graphics processing units (GPUs). These methodologies break down as application complexity grows to contain multiple communicating data-parallel kernels. This paper introduces MKMD, an automatic system for mapping multiple kernels across […]

OpenCL

Nov, 25

Acceleration of Agent-Based Pandemic Modeling on Multiple GPUs

Epidemiology computation models are crucial for the assessment and control of public health crises. Agent-based simulations of pandemic influenza are useful for forecasting the infectious disease spreading in order to help public health policy makers during emergencies. In such emergencies decisions are required for public health preparedness in cycles of less than a day, and […]

CUDA

Nov, 25

Efficient Resource Sharing Through GPU Virtualization on Accelerated High Performance Computing Systems

The High Performance Computing (HPC) field is witnessing a widespread adoption of Graphics Processing Units (GPUs) as co-processors for conventional homogeneous clusters. The adoption of prevalent Single-Program Multiple-Data (SPMD) programming paradigm for GPU-based parallel processing brings in the challenge of resource underutilization, with the asymmetrical processor/co-processor distribution. In other words, under SPMD, balanced CPU/GPU distribution […]

CUDA

Nov, 25

Optimization of a Machine Learning Algorithm on the Heterogeneous system using OpenCL

Today, there is no one who disagrees on how important data is in every industry especially in enterprise market. More recently, the key point that decides the survival of a business is the management of their big data, which is defined by the 3V’s: Volume, Velocity, and Variety [1]. While the rate of data generation […]

OpenCL

Nov, 25

GPU-based Acceleration of Deep Convolutional Neural Networks on Mobile Platforms

Mobile applications running on wearable devices and smartphones can greatly benefit from accurate and scalable deep CNN-based machine learning algorithms. While mobile CPU performance does not match the intensive computational requirement of deep CNNs, the embedded GPU which already exists in many mobile platforms can be leveraged for acceleration of CNN computations on the local […]

Nov, 25

Pulsar Acceleration Searches on the GPU for the Square Kilometre Array

Pulsar acceleration searches are methods for recovering signals from radio telescopes, that may otherwise be lost due to the effect of orbital acceleration in binary systems. The vast amount of data that will be produced by next generation instruments such as the Square Kilometre Array (SKA) necessitates real-time acceleration searches, which in turn requires the […]

CUDA

Nov, 24

Learning Representation for Scene Understanding: Epitomes, CRFs, and CNNs

Scene understanding, such as image classification and semantic image segmentation, has been a challenging problem in computer vision. The difficulties mainly come from the feature representation, i.e., how to find a good representation for images. Instead of improving over hand-crafted features such as SIFT or HoG, we focus on learning image representations by generative and […]

CUDA

Nov, 24

A parallel algorithm for the constrained shortest path problem on lattice graphs

We present a parallel algorithm for finding the shortest path whose total weight is smaller than a pre-determined value. The passage times over the edges are assumed to be positive integers. In each step the processing elements are not analyzing the entire graph. Instead they are focusing on a subset of vertices called active vertices. […]

OpenCL

Nov, 24

Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications

Although the latest high-end smartphone has powerful CPU and GPU, running deeper convolutional neural networks (CNNs) for complex tasks such as ImageNet classification on mobile devices is challenging. To deploy deep CNNs on mobile devices, we present a simple and effective scheme to compress the entire CNN, which we call one-shot whole network compression. The […]

CUDA

Nov, 24

Comparative Study of Caffe, Neon, Theano, and Torch for Deep Learning

Deep learning methods have resulted in significant performance improvements in several application domains and as such several software frameworks have been developed to facilitate their implementation. This paper presents a comparative study of four deep learning frameworks, namely Caffe, Neon, Theano, and Torch, on three aspects: extensibility, hardware utilization, and speed. The study is performed […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Semantic Segmentation of Colon Glands with Deep Convolutional Neural Networks and Total Variation Segmentation

A Problem-Based Learning Approach to GPU Computing

Orchestrating Multiple Data-Parallel Kernels on Multiple Devices

Acceleration of Agent-Based Pandemic Modeling on Multiple GPUs

Efficient Resource Sharing Through GPU Virtualization on Accelerated High Performance Computing Systems

Optimization of a Machine Learning Algorithm on the Heterogeneous system using OpenCL

GPU-based Acceleration of Deep Convolutional Neural Networks on Mobile Platforms

Pulsar Acceleration Searches on the GPU for the Square Kilometre Array

Learning Representation for Scene Understanding: Epitomes, CRFs, and CNNs

A parallel algorithm for the constrained shortest path problem on lattice graphs

Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications

Comparative Study of Caffe, Neon, Theano, and Torch for Deep Learning

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)