Posts
Jun, 7
The implementation and optimization of Bitonic sort algorithm based on CUDA
This paper describes in detail the bitonic sort algorithm,and implements the bitonic sort algorithm based on cuda architecture. At the same time,we conduct two effective optimization of implementation details according to the characteristics of the GPU, which greatly improve the efficiency. Finally,we survey the optimized Bitonic sort algorithm on the GPU with the speedup of […]
Jun, 7
A Parallel Implementation of the Galerkin Method for Solving Partial Differential Equations on a Triangular Mesh
Finite Element Methods are techniques for estimating solutions to boundary value problems for partial differential equations from an approximating subspace. These methods are based on weak or variational forms of the BVP that require less of the problem functions than what the original PDE would suggest in terms of order of differentiability and continuity. In […]
Jun, 5
Machine Learning Based Auto-tuning for Enhanced OpenCL Performance Portability
Heterogeneous computing, which combines devices with different architectures, is rising in popularity, and promises increased performance combined with reduced energy consumption. OpenCL has been proposed as a standard for programing such systems, and offers functional portability. It does, however, suffer from poor performance portability, code tuned for one device must be re-tuned to achieve good […]
Jun, 5
Accelerated Nodal Discontinuous Galerkin Simulations for Reverse Time Migration with Large Clusters
Improving both accuracy and computational performance of numerical tools is a major challenge for seismic imaging and generally requires specialized implementations to make full use of modern parallel architectures. We present a computational strategy for reverse-time migration (RTM) with accelerator-aided clusters. A new imaging condition computed from the pressure and velocity fields is introduced. The […]
Jun, 5
Blocks and Fuel: Frameworks for deep learning
We introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano’s symbolic computational graph, and providing an extensive set of utilities to […]
Jun, 5
Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS
GROMACS is a widely used package for biomolecular simulation, and over the last two decades it has evolved from small-scale efficiency to advanced heterogeneous acceleration and multi-level parallelism targeting some of the largest supercomputers in the world. Here, we describe some of the ways we have been able to realize this through the use of […]
Jun, 5
Fast algorithms and efficient GPU implementations for the Radon transform and the back-projection operator represented as convolution operators
The Radon transform and its adjoint, the back-projection operator, can both be expressed as convolutions in log-polar coordinates. Hence, fast algorithms for the application of the operators can be constructed by using FFT, if data is resampled at log-polar coordinates. Radon data is typically measured on an equally spaced grid in polar coordinates, and reconstructions […]
Jun, 5
7th International Conference on Signal Processing Systems (ICSPS), 2015
Topics: Adaptive Filtering & Signal Processing Ad-Hoc and Sensor Networks Analog and Mixed Signal Processing Array Signal Processing Audio and Electroacoustics Audio/Speech Processing and Coding Bioimaging and Signal Processing Biometrics & Authentification Biosignal Processing & Understanding Communication and Broadband Networks Communication Signal processing Computer Vision & Virtual Reality Cryptography and Network Security Design and Implementation […]
Jun, 3
A Survey of Software Techniques for Using Non-Volatile Memories for Storage and Main Memory Systems
Non-volatile memory (NVM) devices, such as Flash, phase change RAM, spin transfer torque RAM, and resistive RAM, offer several advantages and challenges when compared to conventional memory technologies, such as DRAM and magnetic hard disk drives (HDDs). In this paper, we present a survey of software techniques that have been proposed to exploit the advantages […]
Jun, 1
Genetically Improved BarraCUDA
BarraCUDA is a C program which uses the BWA algorithm in parallel with nVidia CUDA to align short next generation DNA sequences against a reference genome. The genetically improved (GI) code is up to three times faster on short paired end reads from The 1000 Genomes Project and 60 percent more accurate on a short […]
Jun, 1
Research on the fast Fourier transform of image based on GPU
Study of general purpose computation by GPU (Graphics Processing Unit) can improve the image processing capability of micro-computer system. This paper studies the parallelism of the different stages of decimation in time radix 2 FFT algorithm, designs the butterfly and scramble kernels and implements 2D FFT on GPU. The experiment result demonstrates the validity and […]
Jun, 1
A Parallel Cellular Automaton Simulation Framework using CUDA
In the current digital age, the use of cellular automata to simulate natural systems has grown more popular as our understanding of cellular systems increases. Up until about a decade ago, digital models based on the concept of cellular automata have primarily been simulated with sequential rule application algorithms, which do not exploit the inherent […]