## Posts

Sep, 5

### Parallel Tree Traversal for Nearest Neighbor Query on the GPU

The similarity search problem is found in many application domains including computer graphics, information retrieval, statistics, computational biology, and scientific data processing just to name a few. Recently several studies have been performed to accelerate the k-nearest neighbor (kNN) queries using GPUs, but most of the works develop brute-force exhaustive scanning algorithms leveraging a large […]

Sep, 5

### Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs

Leveraging large data sets, deep Convolutional Neural Networks (CNNs) achieve state-of-the-art recognition accuracy. Due to the substantial compute and memory operations, however, they require significant execution time. The massive parallel computing capability of GPUs make them as one of the ideal platforms to accelerate CNNs and a number of GPU-based CNN libraries have been developed. […]

Sep, 5

### Implementation and Performance Analysis of Many-body Quantum Chemical Methods on the Intel Xeon Phi Coprocessor and NVIDIA GPU Accelerator

CCSD(T), part of coupled cluster (CC) method, is one of the most accurate methods applicable to reasonably large molecules in computational chemistry field. The ability of an efficient parallel CCSD(T) implementation will have a significant impact on application of the high-accuracy methods. Intel Xeon Phi Coprocessor and NVIDIA GPU are the most important coprocessors/accelerators which […]

Sep, 5

### Improving the Performance of the Contextual Spaces Re-Ranking Algorithm on Heterogeneous Systems

Re-ranking algorithms have been proposed to improve the effectiveness of Content-Based Image Retrieval (CBIR) systems by exploiting contextual information encoded in distance measures and ranked lists. In this paper, we show how we improved the efficiency of one of these algorithms, called Contextual Spaces Re-Ranking. We propose a modification to the algorithm that reduces its […]

Sep, 5

### A Survey of Techniques for Architecting Processor Components using Domain Wall Memory

Recent trends of increasing core-count and bandwidth/memory-wall have motivated the researchers to explore novel memory technologies for designing processor components such as cache, register file, shared memory, etc. Domain wall memory (DWM), also known as racetrack memory, is a promising emerging technology due to its non-volatility and very high density. However, use of DWM presents […]

Sep, 3

### Understanding the Impact of Hybrid Programming on Software Energy Efficiency

High performance computing systems today are heterogeneous in nature with multiple CPUs and accelerators/coprocessors in each computing node. The majority of today’s programs only utilize single computing components (e.g. a CPU, GPU or Xeon Phi) while leaving other components idle (e.g. waiting for the results to be calculated). This may not be optimal for either […]

Sep, 3

### Matrix Computations and Optimization in Apache Spark

We describe matrix computations available in the cluster programming framework, Apache Spark. Out of the box, Spark provides abstractions and implementations for distributed matrices and optimization routines using these matrices. When translating single-node algorithms to run on a distributed cluster, we observe that often a simple idea is enough: separating matrix operations from vector operations […]

Sep, 3

### Ultra-Fast Detection of Higher-Order Epistatic Interactions on GPUs

Detecting higher-order epistatic interactions in Genome-Wide Association Studies (GWAS) remains a challenging task in the fields of genetic epidemiology and computer science. A number of algorithms have recently been proposed for epistasis discovery. However, they suffer from a high computational cost since statistical measures have to be evaluated for each possible combination of markers. Hence, […]

Sep, 3

### Fast 4D Sheared Filtering for Interactive Rendering of Distribution Effects

Soft shadows, depth of field, and diffuse global illumination are common distribution effects, usually rendered by Monte Carlo ray tracing. Physically correct, noise-free images can require hundreds or thousands of ray samples per pixel, and take a long time to compute. Recent approaches have exploited sparse sampling and filtering; the filtering is either fast (axisaligned), […]

Sep, 3

### DeepPy: Pythonic deep learning

This technical report introduces DeepPy – a deep learning framework built on top of NumPy with GPU acceleration. DeepPy bridges the gap between highperformance neural networks and the ease of development from Python/NumPy. Users with a background in scientific computing in Python will quickly be able to understand and change the DeepPy codebase as it […]

Aug, 31

### Deep Learning on FPGAs

The recent successes of deep learning are largely attributed to the advancement of hardware acceleration technologies, which can accommodate the incredible growth of data sizes and model complexity. The current solution involves using clusters of graphics processing units (GPU) to achieve performance beyond that of general purpose processors (GPP), but the use of field programmable […]

Aug, 31

### SafeGPU: Contract- and Library-Based GPGPU for Object-Oriented Languages

Using GPUs as general-purpose processors has revolutionized parallel computing by providing, for a large and growing set of algorithms, massive data-parallelization on desktop machines. An obstacle to their widespread adoption, however, is the difficulty of programming them and the low-level control of the hardware required to achieve good performance. This paper proposes a programming approach, […]