Posts
Jul, 3
Optimizing the Performance of Parallel and Concurrent Applications Based on Asynchronous Many-Task Runtimes
Nowadays, High-performance Computing (HPC) scientific applications often face performance challenges when running on heterogeneous supercomputers, so do scalability, portability, and efficiency issues. For years, supercomputer architectures have been rapidly changing and becoming more complex, and this challenge will become even more complicated as we enter the exascale era, where computers will exceed one quintillion calculations […]
Jul, 3
TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s
This paper presents a novel nearest neighbor search algorithm achieving TPU (Google Tensor Processing Unit) peak performance, outperforming state-of-the-art GPU algorithms with similar level of recall. The design of the proposed algorithm is motivated by an accurate accelerator performance model that takes into account both the memory and instruction bottlenecks. Our algorithm comes with an […]
Jun, 26
An experimental study of group-by and aggregation on CPU-GPU processors
Hash-based group-by and aggregation is a fundamental operator in database systems. Modern discrete GPUs (graphics processing units) have been considered to accelerate the performance. However, the data transfer through the PCIe (peripheral component interconnect express) bus would reduce gains. On recent architectures, the GPU and the CPU (central processing unit) are built into the same […]
Jun, 26
SnuHPL: high performance LINPACK for heterogeneous GPUs
These days, it is typical for a large-scale cluster system to have different kinds of GPUs. However, HPL (High-Performance LINPACK), the de-facto standard LINPACK implementation for evaluating the performance of a cluster system, is originally designed to work only for homogeneous CPU-only systems. In this paper, we develop SnuHPL, an optimized HPL for clusters of […]
Jun, 26
tntorch: Tensor Network Learning with PyTorch
We present tntorch, a tensor learning framework that supports multiple decompositions (including Candecomp/Parafac, Tucker, and Tensor Train) under a unified interface. With our library, the user can learn and handle low-rank tensors with automatic differentiation, seamless GPU support, and the convenience of PyTorch’s API. Besides decomposition algorithms, tntorch implements differentiable tensor algebra, rank truncation, cross-approximation, […]
Jun, 26
Deep Learning Models on CPUs: A Methodology for Efficient Training
GPUs have been favored for training deep learning models due to their highly parallelized architecture. As a result, most studies on training optimization focus on GPUs. There is often a trade-off, however, between cost and efficiency when deciding on how to choose the proper hardware for training. In particular, CPU servers can be beneficial if […]
Jun, 26
Open-source FPGA-ML codesign for the MLPerf Tiny Benchmark
We present our development experience and recent results for the MLPerf Tiny Inference Benchmark on field-programmable gate array (FPGA) platforms. We use the open-source hls4ml and FINN workflows, which aim to democratize AI-hardware codesign of optimized neural networks on FPGAs. We present the design and implementation process for the keyword spotting, anomaly detection, and image […]
Jun, 19
MapReduce for Counting Word Frequencies with MPI and GPUs
In this project, the goal was to use the Julia programming language and parallelization to write a fast map reduce algorithm to count word frequencies across large numbers of documents. We first implement the word frequency counter algorithm on a CPU using two processes with MPI. Then, we create another implementation, but on a GPU […]
Jun, 19
PILC: Practical Image Lossless Compression with an End-to-end GPU Oriented Neural Framework
Generative model based image lossless compression algorithms have seen a great success in improving compression ratio. However, the throughput for most of them is less than 1 MB/s even with the most advanced AI accelerated chips, preventing them from most real-world applications, which often require 100 MB/s. In this paper, we propose PILC, an end-to-end […]
Jun, 19
Securing GPU via Region-based Bounds Checking
Graphics processing units (GPUs) have become essential general-purpose computing platforms to accelerate a wide range of workloads, such as deep learning, scientific, and high-performance computing (HPC) applications. However, recent memory corruption attacks, such as buffer overflow, exposed security vulnerabilities in GPUs. We demonstrate that out-of-bounds writes are reproducible on an Nvidia GPU, which can enable […]
Jun, 19
CoDL: Efficient CPU-GPU Co-execution for Deep Learning Inference on Mobile Devices
Concurrent inference execution on heterogeneous processors is critical to improve the performance of increasingly heavy deep learning (DL) models. However, available inference frameworks can only use one processor at a time, or hardly achieve speedup by concurrent execution compared to using one processor. This is due to the challenges to 1) reduce data sharing overhead, […]
Jun, 19
CuPBoP: CUDA for Parallelized and Broad-range Processors
CUDA is one of the most popular choices for GPU programming, but it can only be executed on NVIDIA GPUs. Executing CUDA on non-NVIDIA devices not only benefits the hardware community, but also allows data-parallel computation in heterogeneous systems. To make CUDA programs portable, some researchers have proposed using source-to-source translators to translate CUDA to […]