Posts
Jun, 17
Parallel Monte Carlo on Intel MIC Architecture
Trade-off between the cost-efficiency of powerful computational accelerators and the increasing energy needed to perform numerical tasks can be tackled by implementation of algorithms on the Intel Multiple Integrated Cores (MIC) architecture. The best performance of the algorithms requires the use of appropriate optimization and parallelization approaches throughout all process of their design. Monte Carlo […]
Jun, 17
Parallel Computing of Particle Trajectory Sonification to Enable Real-Time Interactivity
In this paper, we revisit, explore and extend the Particle Trajectory Sonification (PTS) model, which supports cluster analysis of high-dimensional data by probing a model space with virtual particles which are "gravitationally" attracted to a mode of the dataset’s potential function. The particles’ kinetic energy progression of as function of time adds directly to a […]
Jun, 10
Smith-Waterman Acceleration in Multi-GPUs: A Performance per Watt Analysis
We present a performance per watt analysis of CUDAlign 4.0, a parallel strategy to obtain the optimal alignment of huge DNA se- quences in multi-GPU platforms using the exact Smith-Waterman method. Speed-up factors and energy consumption are monitored on different stages of the algorithm with the goal of identifying advantageous sce- narios to maximize acceleration […]
Jun, 10
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker […]
Jun, 10
Crane – Fast and Migratable GPU Passthrough for OpenCL applications
General purpose GPU (GPGPU) computing in virtualized environments leverages PCI passthrough to achieve GPU performance comparable to bare-metal execution. However, GPU passthrough prevents service administrators from performing virtual machine migration between physical hosts. Crane is a new technique for virtualizing OpenCL-based GPGPU computing that achieves within 5.25% of passthrough GPU performance while supporting VM migration. […]
Jun, 10
CELES: CUDA-accelerated simulation of electromagnetic scattering by large ensembles of spheres
CELES is a freely available MATLAB toolbox to simulate light scattering by many spherical particles. Aiming at high computational performance, CELES leverages block-diagonal preconditioning, a lookup-table approach to evaluate costly functions and massively parallel execution on NVIDIA graphics processing units using the CUDA computing platform. The combination of these techniques allows to efficiently address large […]
Jun, 10
MobiRNN: Efficient Recurrent Neural Network Execution on Mobile GPU
In this paper, we explore optimizations to run Recurrent Neural Network (RNN) models locally on mobile devices. RNN models are widely used for Natural Language Processing, Machine Translation, and other tasks. However, existing mobile applications that use RNN models do so on the cloud. To address privacy and efficiency concerns, we show how RNN models […]
Jun, 5
Neneta: Heterogeneous Computing Complex-Valued Neural Network Framework
Due to increased demand for computational efficiency for the training, validation and testing of artificial neural networks, many open source software frameworks have emerged. Almost exclusively GPU programming model of choice in such software frameworks is CUDA. Symptomatic is also lack of the support for complex-valued neural networks. With our research going exactly in that […]
Jun, 5
Speedup and Parallelization Models for Energy-Efficient Many-Core Systems Using Performance Counters
Traditional speedup models, such as Amdahl’s, facilitate the study of the impact of running parallel workloads on manycore systems. However, these models are typically based on software characteristics, assuming ideal hardware behaviors. As such, the applicability of these models for energy and/or performance-driven system optimization is limited by two factors. Firstly, speedup cannot be measured […]
Jun, 5
Program Acceleration in a Heterogeneous Computing Environment Using OpenCL, FPGA, and CPU
Reaching the so-called "performance wall" in 2004 inspired innovative approaches to performance improvement. Parallel programming, distributive computing, and System on a Chip (SOC) design drove change. Hardware acceleration in mainstream computing systems brought significant improvement in the performance of applications targeted directly to a specific hardware platform. Targeting a single hardware platform, however, typically requires […]
Jun, 5
UT-OCL: An OpenCL Framework for Embedded Systems Using Xilinx FPGAs
The number of heterogeneous components on a System-on-Chip (SoC) has continued to increase. Software developers leverage these heterogeneous systems by using high-level languages to enable the execution of applications. For the application to execute correctly, hardware support for features and constructs of the programming model need to be incorporated into the system. OpenCL is a […]
Jun, 5
A Diversified Multi-Start Algorithm for Unconstrained Binary Quadratic Problems Leveraging the Graphics Processor Unit
Multi-start algorithms are a common and effective tool for metaheuristic searches. In this paper we amplify multi-start capabilities by employing the parallel processing power of the graphics processer unit (GPU) to quickly generate a diverse starting set of solutions for the Unconstrained Binary Quadratic Optimization Problem which are evaluated and used to implement screening methods […]