- •ApplicationsWhere it's
- •HardwareSpecs and
- •ProgrammingAlgorithms and techniques
- •ResourcesSource codes,
tutorials, books, etc.
The most recent entries
Parallel implementation of the wideband DOA algorithm on single core, multicore, GPU and IBM cell BE processor
The Multiple Signal Classification (MUSIC) algorithm is a powerful technique for determining the Direction of Arrival (DOA) of signals impinging on an antenna array.The algorithm is serial based, mathematically intensive, and requires substantial computing power to realize in real-time.Recently, multi-core processors are becoming more prevalent and affordable.The challenge of adapting existing serial based algorithms to parallel based algorithms suitable for today's multi-core processors is daunting. DOA algorithm has been implemented on Multicore (Intel Nehalem Quad Core), NVIDIA's GPU...
CONTEXT. The cryptographically secure pseudo-random number generator Blum Blum Shub (BBS) is a simple algorithm with a strong security proof, however it requires very large numbers to be secure, which makes it computationally heavy. The Graphics Processing Unit (GPU) is a common vector processor originally dedicated to computer-game graphics, but has since been adapted to perform general-purpose computing. The GPU has a large potential for fast general-purpose parallel computing but due to its architecture it is difficult to adapt certain algorithms to utilise the full computational power of...
Graphics Processing Unit (GPU) has been converted to general purpose parallel processor devices from a single rendering. It performed far better than the CPU in many fields of science. String matching is widely used, especially in information retrieval, intrusion detection, Computational Biology etc. In this paper, we designed and implemented a GPU-based multi-string matching algorithm by improving traditional serial WM algorithm, called G-WM, which respectively is 12 and 11.2 times performance to serial WM algorithm using equal and Unequal length pattern sets.
High-performance computing systems today include a variety of compute devices such as multi-core CPUs, GPUs and many-core accelerators. OpenCL allows programming different types of compute devices using a single API and kernel language. However, there is no standard matrix operations library in OpenCL for operations such as matrix multiplication that works well on a variety of hardware from multiple vendors. We implemented an OpenCL auto-tuning library for real and complex variants of general matrix multiply (GEMM) and present detailed performance results and analysis on a variety of GPU and...
An algorithm is proposed for tracking objects in real time. The algorithm is based on neural network implemented on GPU. Investigation and parameter optimization of the algorithm are realized. Tracking process has accelerated by 10 times and the training process has accelerated by 2 times versus to the sequential algorithm version. The maximum resolution of the frame for real-time tracking and the optimum frame sampling from a movie are calculated.
Swarm intelligence algorithms have been widely used to solve difficult real world problems in both academic and engineering domains. Thanks to the inherent parallelism, various parallelized swarm intelligence algorithms have been proposed to speed up the optimization process, especially on the massively parallel processing architecture GPUs. However, conventional swarm intelligence algorithms are usually not designed specifically for the GPU architecture. They either can not fully exploit the tremendous computational power of GPUs or can not extend effectively as the problem scales go large....
Computing highly-accurate approximate solutions to partial differential equations (PDEs) requires both a robust numerical method and a powerful machine. We present a parallel implementation of the discontinuous Galerkin (DG) method on graphics processing units (GPUs). In addition to being flexible and highly accurate, DG methods accommodate parallel architectures well, as their discontinuous nature produces entirely element-local approximations. While GPUs were originally intended to compute and display computer graphics, they have recently become a popular general purpose computing device....
Modern radio telescopes, such as the Low Frequency Array (LOFAR) in the north of the Netherlands, process the signal from the sky in software rather than expensive special purpose hardware, This gives the astronomers an unprecedented flexibility to perform a vast amount of various scientific experiments. However, designing the actual software that would give optimal performance for many different experiments, possibly also running on different hardware is a challenging task. Since optimizing the software by hand to fit the various experiments and hardware is unfeasible, we employ a technique...
Three-dimensional simulations of buoyancy-driven flow of two immiscible liquids are performed using lattice Boltzmann method (LBM) implemented on a graphics processing unit (GPU). Graphics processing unit is a new paradigm for computing fluid flows and has become more popular in the recent years. It is a powerful and convenient to use. LBM, which is an excellent alternative technique for fluid flow simulation, when implemented on GPUs gives a very high computational speed-up. Our present GPU based LBM solver gives a speed-up 25 times corresponding CPU based code.
Matrix multiplication is a commonly-used mathematical operation that has many practical applications. It is used to solve a number of problems in a wide variety of fields including science, engineering, and computer science. Given two matrices, A and B, and a resultant matrix C. The concept of density is used to describe the number of nonzero elements in a matrix relative to the total number of elements. For an NxM matrix with Z nonzero elements, the density is defined as Z=(NxM). A sparse matrix is one which has a low density. Sparse matrices can be stored in special formats to eliminate the...
With the raise of cloud computing infrastructures on one side and the increased accessibility of parallel computational devices on the other, such as GPUs and multi-core CPUs, parallel programming has recently gained a renewed interest. This is particularly true in the domain of video coding, where the complexity and time consumption of the algorithms tend to limit the access to the core technology. In this work, we focus on the motion estimation problem, well-known to be the most time consuming step of a majority of video coding techniques. By relying on the use of the OpenCL standard, which...
Clusters of heterogeneous nodes composed of multi-core CPUs and GPUs are increasingly being used for High Performance Computing (HPC) due to the benefits in peak performance and energy efficiency. In order to fully harvest the computational capabilities of such architectures, application developers often employ a combination of different parallel programming paradigms (e.g. OpenCL, CUDA, MPI and OpenMP), also known in literature as hybrid programming, which makes application development very challenging. Furthermore, these languages offer limited support to orchestrate data and computations...
Most viewed papers (last 30 days)
- Graphics Programming on the Web WebCL Course Notes
- Use NVIDIA CUDA technology to create genetic algorithms with extensive population
- Simulating the universe with GPU-accelerated supercomputers: n-body methods, tests, and examples
- Implementations of the FFT algorithm on GPU
- GPU Scripting and Code Generation with PyCUDA
- Secrets from the GPU
- A General-Purpose GPU Reservoir Computer
- One OpenCL to Rule Them All?
- Fluid Motion Modelling Using Vortex Particle Method on GPU
- Adding GPU Computing to Computer Organization Courses
Adaptive Dynamic Load Balancing in Heterogeneous Multiple GPUs-CPUs Distributed Setting: Case Study of B&B Tree Search
Graphics Programming on the Web WebCL Course Notes
Automatic Compilation for Heterogeneous Architectures with Single Assignment C
A parallel decoding algorithm of LDPC codes using CUDA
Mr. Scan: Extreme Scale Density-Based Clustering using a Tree-Based Network of GPGPU Nodes
Optimizing MapReduce for GPUs with effective shared memory usage
Comprehensive Analysis of High-Performance Computing Methods for Filtered Back-Projection
Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling
CUDA implementation of the algorithm for simulating the epidemic spreading over large networks
Stencil-Aware GPU Optimization of Iterative Solvers
October 1-4, 2013
November 13-15, 2013
February 2-6, 2014
San Francisco, USA
February 12-14, 2014
November 11-14, 2013
San Jose, California, USA
Registered users can now run their OpenCL application at hgpu.org. We provide 1 minute of computer time per each run on two nodes with two AMD and one nVidia graphics processing units, correspondingly. There are no restrictions on the number of starts.
The platforms are
- GPU device 0: AMD/ATI Radeon HD 5870 2GB, 850MHz
- GPU device 1: AMD/ATI Radeon HD 6970 2GB, 880MHz
- CPU: AMD Phenom II X6 @ 2.8GHz 1055T
- RAM: 12GB
- HDD: 2TB, Raid-0
- OS: OpenSUSE 11.4
- SDK: AMD APP SDK 2.8
- GPU device 0: AMD/ATI Radeon HD 7970 3GB, 1000MHz
- GPU device 1: nVidia GeForce GTX 560 Ti 2GB, 822MHz
- CPU: Intel Core i7-2600 @ 3.4GHz
- RAM: 16GB
- HDD: 2TB, Raid-0
- OS: OpenSUSE 12.2
- SDK: nVidia CUDA Toolkit 5.0.35, AMD APP SDK 2.8
Completed OpenCL project should be uploaded via User dashboard (see instructions and example there), compilation and execution terminal output logs will be provided to the user.