## Posts

Oct, 29

### Hetero-Mark, A Benchmark Suite for CPU-GPU Collaborative Computing

Graphics Processing Units (GPUs) can easily outperform CPUs in processing large-scale data parallel workloads, but are considered weak in processing serialized tasks and communicating with other devices. Pursuing a CPU-GPU collaborative computing model which takes advantage of both devices could provide an important breakthrough in realizing the full performance potential of heterogeneous computing. In recent […]

Oct, 29

### GPflow: A Gaussian process library using TensorFlow

GPflow is a Gaussian process library that uses TensorFlow for its core computations and Python for its front end. The distinguishing features of GPflow are that it uses variational inference as the primary approximation method, provides concise code through the use of automatic differentiation, has been engineered with a particular emphasis on software testing and […]

Oct, 29

### GPU Performance Modeling and Optimization

The last decade has witnessed the blooming emergence of general-purpose Graphic-Processing-Unit computing (GPGPU). With the exponential growth of cores and threads in a modern GPU processor, how to analyze and optimize its performance becomes a grand challenge. In this thesis, as the modeling part, we propose an analytic model for throughput-oriented parallel processors. The model […]

Oct, 29

### GOTHIC: Gravitational oct-tree code accelerated by hierarchical time step controlling

The tree method is a widely implemented algorithm for collisionless N-body simulations in astrophysics well suited for GPU(s). Adopting hierarchical time stepping can accelerate N-body simulations; however, it is infrequently implemented and its potential remains untested in GPU implementations. We have developed a Gravitational Oct-Tree code accelerated by HIerarchical time step Controlling named GOTHIC, which […]

Oct, 29

### Accelerating BLAS on Custom Architecture through Algorithm-Architecture Co-design

Basic Linear Algebra Subprograms (BLAS) play key role in high performance and scientific computing applications. Experimentally, yesteryear multicore and General Purpose Graphics Processing Units (GPGPUs) are capable of achieving up to 15 to 57% of the peak performance at 65W to 240W of power respectively in underlying platform for compute bound operations like Double/Single Precision […]

Oct, 22

### Efficient Random Sampling – Parallel, Vectorized, Cache-Efficient, and Online

We consider the problem of sampling $n$ numbers from the range ${1,ldots,N}$ without replacement on modern architectures. The main result is a simple divide-and-conquer scheme that makes sequential algorithms more cache efficient and leads to a parallel algorithm running in expected time $mathcal{O}left(n/p+log pright)$ on $p$ processors. The amount of communication between the processors is […]

Oct, 22

### Sparse-Matrix support for the SkePU library for portable CPU/GPU programming

In this thesis work we have extended the SkePU framework by designing a new container data structure for the representation of generic two dimensional sparse matrices. Computation on matrices is an integral part of many scientific and engineering problems. Sometimes it is unnecessary to perform costly operations on zero entries of the matrix. If the […]

Oct, 22

### Energy-efficient FPGA Implementation of the k-Nearest Neighbors Algorithm Using OpenCL

Modern SoCs are getting increasingly heterogeneous with a combination of multi-core architectures and hardware accelerators to speed up the execution of compute-intensive tasks at considerably lower power consumption. Modern FPGAs, due to their reasonable execution speed and comparatively lower power consumption, are strong competitors to the traditional GPU based accelerators. High-level Synthesis (HLS) simplifies FPGA […]

Oct, 22

### CuMF_SGD: Fast and Scalable Matrix Factorization

Matrix factorization (MF) has been widely used in e.g., recommender systems, topic modeling and word embedding. Stochastic gradient descent (SGD) is popular in solving MF problems because it can deal with large data sets and is easy to do incremental learning. We observed that SGD for MF is memory bound. Meanwhile, single-node CPU systems with […]

Oct, 22

### OpenMP, OpenMP/MPI, and CUDA/MPI C programs for solving the time-dependent dipolar Gross-Pitaevskii equation

We present new versions of the previously published C and CUDA programs for solving the dipolar Gross-Pitaevskii equation in one, two, and three spatial dimensions, which calculate stationary and non-stationary solutions by propagation in imaginary or real time. Presented programs are improved and parallelized versions of previous programs, divided into three packages according to the […]

Oct, 15

### Efficient molecular dynamics simulations with many-body potentials on graphics processing units

Graphics processing units have been extensively used to accelerate classical molecular dynamics simulations. However, there is much less progress on the acceleration of force evaluations for many-body potentials compared to pairwise ones. In the conventional force evaluation algorithm for many-body potentials, the force, virial stress, and heat current for a given atom are accumulated within […]

Oct, 15

### Reordering strategy for blocking optimization in sparse linear solvers

Solving sparse linear systems is a problem that arises in many scientific applications, and sparse direct solvers are a time consuming and key kernel for those applications and for more advanced solvers such as hybrid direct-iterative solvers. For this reason, optimizing their performance on modern architectures is critical. The preprocessing steps of sparse direct solvers, […]