Posts
Sep, 8
Compilers for Portable Programming of Heterogeneous Parallel & Approximate Computing Systems
Programming heterogeneous systems such as the System-on-chip (SoC) processors in modern mobile devices can be extremely complex because a single system may include multiple different parallelism models, instruction sets, memory hierarchies, and systems use different combinations of these features. This is further complicated by software and hardware approximate computing optimizations. Different compute units on an […]
Sep, 8
Neural Network Inference on Mobile SoCs
The ever-increasing demand from mobile Machine Learning (ML) applications calls for evermore powerful on-chip computing resources. Mobile devices are empowered with Heterogeneous Multi-Processor Systems on Chips (HMPSoCs) to process ML workloads such as Convolutional Neural Network (CNN) inference. HMPSoCs house several different types of ML capable components on-die, such as CPU, GPU, and accelerators. These […]
Sep, 1
Compositional Deep Learning in Futhark
We present a design pattern for composing deep learning networks in a typed, higher-order fashion. The exposed library functions are generically typed and the composition structure allows for networks to be trained (using backpropagation) and for trained networks to be used for predicting new results (using forward-propagation). Individual layers in a network can take different […]
Sep, 1
Demystifying the MLPerf Benchmark Suite
MLPerf, an emerging machine learning benchmark suite strives to cover a broad range of applications of machine learning. We present a study on its characteristics and how the MLPerf benchmarks differ from some of the previous deep learning benchmarks like DAWNBench and DeepBench. We find that application benchmarks such as MLPerf (although rich in kernels) […]
Sep, 1
Visual Performance Analysis of Memory Behavior in a Task-Based Runtime on Hybrid Platforms
Programming parallel applications for heterogeneous HPC platforms is much more straightforward when using the task-based programming paradigm. The simplicity exists because a runtime takes care of many activities usually carried out by the application developer, such as task mapping, load balancing, and memory management operations. In this paper, we present a visualization-based performance analysis methodology […]
Sep, 1
Automated Architecture Design for Deep Neural Networks
Machine learning has made tremendous progress in recent years and received large amounts of public attention. Though we are still far from designing a full artificially intelligent agent, machine learning has brought us many applications in which computers solve human learning tasks remarkably well. Much of this progress comes from a recent trend within machine […]
Sep, 1
Survey and Benchmarking of Machine Learning Accelerators
Advances in multicore processors and accelerators have opened the flood gates to greater exploration and application of machine learning techniques to a variety of applications. These advances, along with breakdowns of several trends including Moore’s Law, have prompted an explosion of processors and accelerators that promise even greater computational and machine learning capabilities. These processors […]
Aug, 25
Position-Dependent Arrays and Their Application for High Performance Code Generation
Modern parallel hardware promises unprecedented performance, for the gifted few experts who can program it correctly. Code generators from high-level languages provide an attractive alternative, promising to deliver high performance automatically. Existing projects such as Accelerate, Futhark, Halide, or Lift show that this approach is feasible. Unfortunately, existing efforts focus on computations over tensors: regularly […]
Aug, 25
stdgpu: Efficient STL-like Data Structures on the GPU
Tremendous advances in parallel computing and graphics hardware opened up several novel real-time GPU applications in the fields of computer vision, computer graphics as well as augmented reality (AR) and virtual reality (VR). Although these applications built upon established opensource frameworks that provide highly optimized algorithms, they often come with custom self-written data structures to […]
Aug, 25
Automatic Compiler Based FPGA Accelerator for CNN Training
Training of convolutional neural networks (CNNs)on embedded platforms to support on-device learning is earning vital importance in recent days. Designing flexible training hard-ware is much more challenging than inference hardware, due to design complexity and large computation/memory requirement. In this work, we present an automatic compiler-based FPGA accelerator with 16-bit fixed-point precision for complete CNNtraining, […]
Aug, 25
On-The-Fly Parallel Data Shuffling for Graph Processing on OpenCL-based FPGAs
Graph processing has attracted much attention recently due to its popularity in many big data analytic applications. With high performance and energy efficiency, FPGAs can be an attractive architecture for graph processing. A number of techniques such as caching using block RAMs (BRAMs) to reduce random accesses of global memory and multiple processing element (PE) […]
Aug, 25
Memory-Efficient Object-Oriented Programming on GPUs
Object-oriented programming is often regarded as too inefficient for high-performance computing (HPC), despite the fact that many important HPC problems have an inherent object structure. Our goal is to bring efficient, object-oriented programming to massively parallel SIMD architectures, especially GPUs. In this thesis, we develop various techniques for optimizing object-oriented GPU code. Most notably, we […]