## Posts

Jan, 10

### linus: Conveniently explore, share, and present large-scale biological trajectory data from a web browser

In biology, we are often confronted with information-rich, large-scale trajectory data, but exploring and communicating patterns in such data is often a cumbersome task. Ideally, the data should be wrapped with an interactive visualisation in one concise package that makes it straightforward to create and test hypotheses collaboratively. To address these challenges, we have developed […]

Jan, 10

### Advances in Electron Microscopy with Deep Learning

This doctoral thesis covers some of my advances in electron microscopy with deep learning. Highlights include a comprehensive review of deep learning in electron microscopy; large new electron microscopy datasets for machine learning, dataset search engines based on variational autoencoders, and automatic data clustering by t-distributed stochastic neighbour embedding; adaptive learning rate clipping to stabilize […]

Jan, 10

### Efficient Nearest-Neighbor Data Sharing in GPUs

Stencil codes (a.k.a. nearest-neighbor computations) are widely used in image processing, machine learning, and scientific applications. Stencil codes incur nearest-neighbor data exchange because the value of each point in the structured grid is calculated as a function of its value and the values of a subset of its nearest-neighbor points. When running on Graphics Processing […]

Jan, 10

### Compound Word Transformer: Learning to Compose Full-Song Music over Dynamic Directed Hypergraphs

To apply neural sequence models such as the Transformers to music generation tasks, one has to represent a piece of music by a sequence of tokens drawn from a finite set of pre-defined vocabulary. Such a vocabulary usually involves tokens of various types. For example, to describe a musical note, one needs separate tokens to […]

Jan, 10

### Hardware Acceleration of HPC Computational Flow Dynamics using HBM-enabled FPGAs

Scientific computing is at the core of many High-Performance Computing applications, including computational flow dynamics. Because of the uttermost importance to simulate increasingly larger computational models, hardware acceleration is receiving increased attention due to its potential to maximize the performance of scientific computing. A Field-Programmable Gate Array is a reconfigurable hardware accelerator that is fully […]

Jan, 6

### 9th International Workshop on OpenCL and SYCL, 2021

IWOCL & SYCLcon is the annual gathering of the international community of OpenCL and SYCL developers, researchers, suppliers and Khronos Working Group members to share best practice, and to advance the use and evolution of the Open Computing Language (OpenCL) and the SYCL standard for C++ programming of heterogeneous platforms and their associated ecosystems. This […]

Jan, 3

### Design, Implementation and Test of Efficient GPU to GPU Communication Methods

Stencil codes are commonly used to solve many problems. On parallel heterogeneous systems with CPUs and GPUs, the domain is usually split and assigned to GPUs, where it is further divided to GPU blocks. The iterative distributed stencil computation consists of two steps – computation and communication, where the subdomains exchange boundary data, also called […]

Jan, 3

### Interactive Parallelization of C Programs in SAPFOR

SAPFOR (System For Automated Parallelization) is a software development suite that is focused on cost reduction of manual program parallelization. SAPFOR produces parallel programs according to the high-level DVMH parallel programming model. SAPFOR relies on an implicitly parallel programming model, so it includes an automatic parallelizing compiler. On the other hand, it allows the user […]

Jan, 3

### I/O Lower Bounds for Auto-tuning of Convolutions in CNNs

Convolution is the most time-consuming part in the computation of convolutional neural networks (CNNs), which have achieved great successes in numerous applications. Due to the complex data dependency and the increase in the amount of model samples, the convolution suffers from high overhead on data movement (i.e., memory access). This work provides comprehensive analysis and […]

Jan, 3

### Thermal Safety and Real-Time Predictability on Heterogeneous Embedded SoC Platforms

Recent embedded systems are designed with high-performance System-on-Chips (SoCs) to satisfy the computational needs of complex applications widely used in real life, such as airplane controllers, autonomous driving automobiles, medical devices, drones, and hand-held devices. Modern SoCs integrate multi-core CPUs and various types of accelerators including GPUs and DSPs. Uncontrolled heat dissipation is one of […]

Jan, 3

### Fast CUDA-Aware MPI Datatypes without Platform Support

MPI Derived Datatypes are an abstraction that simplifies handling of non-contiguous data in MPI applications. These datatypes are recursively constructed at runtime from primitive Named Types defined in the MPI standard. More recently, the development and deployment of CUDA-aware MPI implementations has encouraged the transition of distributed high-performance MPI codes to use GPUs. These implementations […]

Dec, 27

### When Machine Learning Meets Quantum Computers: A Case Study

Along with the development of AI democratization, the machine learning approach, in particular neural networks, has been applied to wide-range applications. In different application scenarios, the neural network will be accelerated on the tailored computing platform. The acceleration of neural networks on classical computing platforms, such as CPU, GPU, FPGA, ASIC, has been widely studied; […]