- •ApplicationsWhere it's
- •HardwareSpecs and
- •ProgrammingAlgorithms and techniques
- •ResourcesSource codes,
tutorials, books, etc.
The most recent entries
Code-message coevolution (CMC) models represent coevolution of a genetic code and a population of protein-coding genes ("messages"). Formally, CMC models are sets of quasispecies coupled together for fitness through a shared genetic code. Although CMC models display plausible explanations for the origin of multiple genetic code traits by natural selection, useful modern implementations of CMC models are not currently available. To meet this need we present CMCpy, an object-oriented Python API and command-line executable front-end that can reproduce all published results of CMC...
March 31, 2013 · >>>
Computer based simulation software having a basis in numerical methods play a major role in research in the area of natural and physical sciences. These tools allow scientists to attempt problems that are too large to solve using analytical methods. But even these tools can fail to give solutions due to computational or storage limits. However, as the performance of computer hardware gets better and better, the computational limits can be also addressed. One such area of work is that of magnetic field modeling, which plays a crucial role in various fields of research, especially those related...
There are a number of design decisions that impact a GPU's performance. Among such decisions deciding the right warp size can deeply influence the rest of the design. Small warps reduce the performance penalty associated with branch divergence at the expense of a reduction in memory coalescing. Large warps enhance memory coalescing significantly but also increase branch divergence. This leaves designers with two choices: use small warps and invest in finding new solutions to enhance coalescing or use large warps and address branch divergence employing effective control-flow solutions. In this...
Graphics Processing Unit Acceleration of the Explicit Solution of the Time Domain Volume Integral Equation Using OpenACC
A graphics processing unit (GPU) accelerated implementation of the explicit solution of the time domain volume integral equation (TD-VIE) using the OpenACC application program interface (API) is presented. The use of the OpenACC API, which is based on a collection of compiler directives implementation, allows for the ease of porting as well as the efficient computation of the TD-VIE solver onto GPUs with up to 50X speedups over a single core serial implementation on a CPU.
Parallel Simulation of Population Balance Model-Based Particulate Processes Using Multicore CPUs and GPUs
Computer-aided modeling and simulation are a crucial step in developing, integrating, and optimizing unit operations and subsequently the entire processes in the chemical/pharmaceutical industry. This study details two methods of reducing the computational time to solve complex process models, namely, the population balance model which given the source terms can be very computationally intensive. Population balance models are also widely used to describe the time evolutions and distributions of many particulate processes, and its efficient and quick simulation would be very beneficial. The...
Data analysis is a rising field of interest for computer science research due to the growing amount of information that is digitally available. This increase in data has as direct consequence that any analysis is significantly complex. By using structured representations for the data sets, like graphs, the analysis becomes feasible, but is still time-consuming. In this project, the focus is on the reduction of the computational time for data analysis, with the introduction of accelerators. Accelerators are specialized hardware components that assist the general processing unit in performing...
Open Compute Language (OpenCL) has been proposed as a platform-independent, parallel execution model to target heterogeneous systems, including multiple central processing units, graphics processing units (GPUs), and digital signal processors (DSPs). OpenCL parallelism scales with the available resources and hardware generational improvements due to the data-parallel nature of its kernels. Such parallel expressions must adhere to a rigid execution model, essentially forcing the run-time system to behave as a batch-scheduler forsmall, local workgroups of a larger global problem. In many...
We explore efficient parallel radix sort for the AMD Fusion Accelerated Processing Unit (APU). Two challenges arise: efficiently partitioning data between the CPU and GPU and the allocation of data in memory regions. Our coarse-grained implementation utilizes both the GPU and CPU by sharing data at the begining and end of the sort. Our fine-grained implementation utilizes the APU's integrated memory system to share data throughout the sort. Both these implementations outperform the current state of the art GPU radix sort from NVIDIA. We therefore demonstrate that the CPU can be efficiently...
K-means clustering and GMM training, as dictionary learning procedures, lie at the heart of many signal processing applications. Increasing data scale requires more efficient ways to perform this process. In this paper a new GPU and multi-core CPU accelerated k-means clustering and GMM training is proposed. We show that both methods can be concisely reformulated into matrix multiplications which allows the application of NVIDIA Compute Unified Device Architecture (CUDA) implemented Basic Linear Algebra Subprograms (CUBLAS) and AMD Core Math Library (ACML) that are highly optimized matrix...
Computational Science has emerged as a third pillar of science along with theory and experiment, where the parallelization for scientific computing is promised by different shared and distributed memory architectures such as, super-computer systems, grid and cluster based systems, multi-core and multiprocessor systems etc. In the recent years the use of GPUs (Graphic Processing Units) for General purpose computing commonly known as GPGPU made it an exciting addition to high performance computing systems (HPC) with respect to price and performance ratio. Current GPUs consist of several hundred...
We study the performance portability of OpenCL across diverse architectures including NVIDIA GPU, Intel Ivy Bridge CPU, and AMD Fusion APU. We present detailed performance analysis at assembly level on three exemplar OpenCL benchmarks: SGEMM, SpMV, and FFT. We also identify a number of tuning knobs that are critical to performance portability, including threads-data mapping, data layout, tiling size, data caching, and operation-specific factors. We further demonstrate that proper tuning could improve the OpenCL portable performance from the current 15% to a potential 67% of the...
PURPOSE: CT reconstruction algorithms implemented on the GPU are highly sensitive to their implementation details and the hardware they run on. Fine-tuning an implementation for optimal performance can be a time consuming task and require many updates when the hardware changes. There are some techniques that do automatic fine-tuning of GPU code. These techniques, however, are relatively narrow in their fine-tuning and are often based on heuristics which can be inaccurate. The goal of this paper is to present a framework that will automate the process of code optimization with maximum...
Most viewed papers (last 30 days)
- Graphics Programming on the Web WebCL Course Notes
- Simulating the universe with GPU-accelerated supercomputers: n-body methods, tests, and examples
- Secrets from the GPU
- Implementations of the FFT algorithm on GPU
- Fluid Motion Modelling Using Vortex Particle Method on GPU
- Adding GPU Computing to Computer Organization Courses
- libWater: Heterogeneous Distributed Computing Made Easy
- Fast Implementation of Scale Invariant Feature Transform Based on CUDA
- Faster Upper Body Pose Estimation and Recognition Using CUDA
- Analyzing Locality of Memory References in GPU Architectures
Optimizing a Biomedical Imaging Orientation Score Framework
Graphics Programming on the Web WebCL Course Notes
Adaptive Dynamic Load Balancing in Heterogeneous Multiple GPUs-CPUs Distributed Setting: Case Study of B&B Tree Search
Duality based optical flow algorithms with applications
In-Place Recursive Approach for All-Pairs Shortest Paths Problem Using OpenCL
A parallel decoding algorithm of LDPC codes using CUDA
Optimizing MapReduce for GPUs with effective shared memory usage
OpenCL parallel Processing using General Purpose Graphical Processing units - TiViPE software development
Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling
Stencil-Aware GPU Optimization of Iterative Solvers
October 1-4, 2013
November 13-15, 2013
February 2-6, 2014
San Francisco, USA
February 12-14, 2014
November 11-14, 2013
San Jose, California, USA
Registered users can now run their OpenCL application at hgpu.org. We provide 1 minute of computer time per each run on two nodes with two AMD and one nVidia graphics processing units, correspondingly. There are no restrictions on the number of starts.
The platforms are
- GPU device 0: AMD/ATI Radeon HD 5870 2GB, 850MHz
- GPU device 1: AMD/ATI Radeon HD 6970 2GB, 880MHz
- CPU: AMD Phenom II X6 @ 2.8GHz 1055T
- RAM: 12GB
- HDD: 2TB, Raid-0
- OS: OpenSUSE 11.4
- SDK: AMD APP SDK 2.8
- GPU device 0: AMD/ATI Radeon HD 7970 3GB, 1000MHz
- GPU device 1: nVidia GeForce GTX 560 Ti 2GB, 822MHz
- CPU: Intel Core i7-2600 @ 3.4GHz
- RAM: 16GB
- HDD: 2TB, Raid-0
- OS: OpenSUSE 12.2
- SDK: nVidia CUDA Toolkit 5.0.35, AMD APP SDK 2.8
Completed OpenCL project should be uploaded via User dashboard (see instructions and example there), compilation and execution terminal output logs will be provided to the user.