## Posts

Nov, 26

### Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs

Deep learning frameworks have been widely deployed on GPU servers for deep learning applications in both academia and industry. In the training of deep neural networks (DNNs), there are many standard processes or algorithms, such as convolution and stochastic gradient descent (SGD), but the running performance of different frameworks might be different even running the […]

Nov, 26

### High Performance Streaming Smith-Waterman Implementation with Implicit Synchronization on Intel FPGA using OpenCL

The Smith-Waterman algorithm is widely used in bioinformatics and is often used as a benchmark of FPGA performance. Here we present our highly optimized SmithWaterman implementation on Intel FPGAs using OpenCL. Our implementation is both faster and more efficient than other current Smith-Waterman implementations, obtaining a theoretical performance of 214 GCUPS. Moreover, due to the […]

Nov, 26

### Computing the distance between two finite element solutions defined on different 3D meshes on a GPU

This article introduces a new method to efficiently compute the distance (i.e., L^p norm of the difference) between two functions supported by two different meshes of the same 3D domain. The functions that we consider are typically finite element solutions discretized in different function spaces supported by meshes that are potentially completely unrelated. Our method […]

Nov, 26

### Efficient Target and Application Specific Selection and Ordering of Compiler Passes

Programmers usually rely on one from a set of optimizing compiler optimization level flags shipped with the compiler they are using to compile their source code. Those compiler flags represent fixed compiler pass sequences, and therefore in some situations better performance and/or other metrics such as code size can be achieved if using compiler sequences […]

Nov, 26

### GPU Pro 7: Advanced Rendering

The latest edition of this bestselling game development reference offers proven tips and techniques for the real-time rendering of special effects and visualization data that are useful for beginners and seasoned game and graphics programmers alike. Exploring recent developments in the rapidly evolving field of real-time rendering, GPU Pro 7: Advanced Rendering Techniques assembles a […]

Nov, 21

### A survey on graphic processing unit computing for large-scale data mining

General purpose computation using Graphic Processing Units (GPUs) is a well-established research area focusing on high-performance computing solutions for massively parallelizable and time-consuming problems. Classical methodologies in machine learning and data mining cannot handle processing of massive and high-speed volumes of information in the context of the big data era. GPUs have successfully improved the […]

Nov, 21

### Bitmap Filter: Speeding up Exact Set Similarity Joins with Bitwise Operations

The Exact Set Similarity Join problem aims to find all similar sets between two collections of sets, with respect to a threshold and a similarity function such as overlap, Jaccard, dice or cosine. The naive approach verifies all pairs of sets and it is often considered impractical due the high number of combinations. So, Exact […]

Nov, 21

### GPU Parallelization for Unstructured Sparse Matrix Problems with OpenMP 4.5 and OpenACC

The effective use of parallelized hardware is an important goal of today’s computer developments. Nvidia GPUs are an important footing in this context. While CUDA implemented algorithms focus on detailed optimized usage of GPU elements the pragma directive parallelization targets GPU computation for a broader community. In this paper we focus on the implementation of […]

Nov, 21

### Compiling and Optimizing OpenMP 4.X Programs to OpenCL and SPIR

Given their massively parallel computing capabilities heterogeneous architectures comprised of CPUs and accelerators have been increasingly used to speed-up scientific and engineering applications. Nevertheless, programming such architectures is a challenging task for most non-expert programmers as typical accelerator programming languages (e.g. CUDA and OpenCL) demand a thoroughly understanding of the underlying hardware to enable an […]

Nov, 21

### Unified Deep Learning with CPU, GPU, and FPGA Technologies

Deep learning and complex machine learning has quickly become one of the most important computationally intensive applications for a wide variety of fields. The combination of large data sets, high-performance computational capabilities, and evolving and improving algorithms has enabled many successful applications which were previously difficult or impossible to consider. This paper explores the challenges […]

Nov, 16

### Hydra: a C++11 framework for data analysis in massively parallel platforms

Hydra is a header-only, templated and C++11-compliant framework designed to perform the typical bottleneck calculations found in common HEP data analyses on massively parallel platforms. The framework is implemented on top of the C++11 Standard Library and a variadic version of the Thrust library and is designed to run on Linux systems, using OpenMP, CUDA […]

Nov, 16

### Launch-time Optimization of OpenCL Kernels

OpenCL kernels are compiled first before kernel arguments and launch geometry are provided later at launch time. Although some of these values remain constant during execution, the compiler is unable to optimize for them since it has no access to them. We propose and implement a novel approach that identifies such arguments, geometry, and optimizations […]