- •ApplicationsWhere it's
- •HardwareSpecs and
- •ProgrammingAlgorithms and techniques
- •ResourcesSource codes,
tutorials, books, etc.
The most recent entries
Air pollution is one of the major problems the world is facing today. Air pollution is caused due to release of dangerous chemical substances such as carbon monoxide, CFC (Chlorofluorocarbon), carbon dioxide, hydro carbon, sulfur dioxide, etc. in to the atmosphere. These substances are produced by various anthropological activities such as usage of vehicles, factory activities, etc. There is a need to assess the air quality to prevent the ill effects of pollutants on the environment. Air Quality Modeling (AQM) is an attempt to predict or simulate the ambient concentrations of contaminants in...
Enabling HPC applications to perform efficiently when invoking multiple parallel libraries simultaneously is a great challenge. Even if a single runtime system is used underneath, scheduling tasks or threads coming from different libraries over the same set of hardware resources introduces many issues, such as resource oversubscription, undesirable cache flushes or memory bus contention. This paper presents an extension of StarPU, a runtime system specifically designed for heterogeneous architectures, that allows multiple parallel codes to run concurrently with minimal interference. Such...
Sequential Consistency for Heterogeneous-Race-Free: Programmer-centric Memory Models for Heterogeneous Platforms
Hardware vendors now provide heterogeneous platforms in commodity markets (e.g., integrated CPUs and GPUs), and are promising an integrated, shared memory address space for such platforms in future iterations. Because not all threads in a heterogeneous platform can communicate with the same latency, vendors are proposing synchronization mechanisms that allow threads to communicate with a subset of threads (called a scope). However, vendors have yet to define a comprehensive and portable memory model that programmers can use to reason about scopes. Moreover, existing CPU memory models, such as...
In this paper we describe a parallel implicit method based on radial basis functions (RBF) for surface reconstruction. The applicability of RBF methods is hindered by its computational demand, that requires the solution of linear systems of size equal to the number of data points. Our reconstruction implementation relies on parallel scientific libraries and is supported for massively multi-core architectures, namely Graphic Processor Units (GPUs). The performance of the proposed method in terms of accuracy of the reconstruction and computing time shows that the RBF interpolant can be very...
Significant new challenges are continuously confronting the High Energy Physics (HEP) experiments, in particular the two detectors at the Large Hadron Collider (LHC) at CERN, where nominal conditions deliver proton-proton collisions to the detectors at a rate of 40 MHz. This rate must be significantly reduced to comply with both the performance limitations of the mass storage hardware and the capabilities of the computing resources to process the collected data in a timely fashion for physics analysis. At the same time, the physics signals of interest must be retained with high efficiency....
The gap between a supercomputer's theoretical maximum ("peak") floating-point performance and that actually achieved by applications has grown wider over time. Today, a typical scientific application achieves only 5-20% of any given machine's peak processing capability, and this gap leaves room for significant improvements in execution times. This problem is most pronounced for modern "accelerator" architectures -- collections of hundreds of simple, low-clocked cores capable of executing the same instruction on dozens of pieces of data simultaneously. This is a significant...
Implementing Continuous Integration Software in an Established Computational Chemistry Software Package
Continuous integration is the software engineering principle of rapid and automated development and testing. We identify several key points of continuous integration and demonstrate how they relate to the needs of computational science projects by discussing the implementation and relevance of these principles to AMBER, a large and widely used molecular dynamics software package. The use of a continuous integration server has both improved collaboration and communication between AMBER developers, who are globally distributed, as well as making failure and benchmark information that would be...
This paper reports on the development of an MPI/OpenCL implementation of LU, an application-level benchmark from the NAS Parallel Benchmark Suite. An account of the design decisions addressed during the development of this code is presented, demonstrating the importance of memory arrangement and work-item/work-group distribution strategies when applications are deployed on different device types. The resulting platform-agnostic, single source application is benchmarked on a number of different architectures, and is shown to be 1.3-1.5x slower than native FORTRAN or CUDA implementations on a...
Super Earths and Dynamical Stability of Planetary Systems: First Parallel GPU Simulations Using GENGA
We report on the stability of hypothetical Super-Earths in the habitable zone of known multi-planetary systems. Most of them have not yet been studied in detail concerning the existence of additional low-mass planets. The new N-body code GENGA developed at the UZH allows us to perform numerous N-body simulations in parallel on GPUs. With this numerical tool, we can study the stability of orbits of hypothetical planets in the semi-major axis and eccentricity parameter space in high resolution. Massless test particle simulations give good predictions on the extension of the stable region and...
Modern computers have graphics cards with much higher theoretical efficiency than conventional CPU. The paper presents application possibilities GPU CUDA acceleration for encryption of data using the new architecture tailored to the 3DES algorithm, characterized by increased security compared to the normal DES. The algorithm used in ECB mode (Electronic Codebook), in which 64-bit data blocks are encrypted independently by stream processors (CUDA cores).
In this study, we widely investigate the problem of string matching in the context of Heterogeneous Parallel Computing. A overview of string matching is made, in which the different forms of string matching problem are distinguished, and the classifications of string matching algorithm are discussed. As an alternative to grep for computational intensive string matching and in addition to support the research of the study, a parallel exact string matching utility "Clgrep" is developed. By experimental studies, we investigate the use of heuristics-based algorithms, specifically QS and...
May 20, 2013 · >>>
In this paper we study a parallel form of the SOR method for the numerical solution of the Convection Diffusion equation suitable for GPUs using CUDA. To exploit the parallelism offered by GPUs we consider the fine grain parallelism model. This is achieved by considering the local relaxation version of SOR. More specifically, we use SOR with red black ordering with two sets of parameters omega_ij and omega'_ij . The parameter omega_ij is associated with each red (i+j even) grid point (ij), whereas the parameter omega'_ij is associated with each black (i+j odd) grid point (ij). The use of a...
May 20, 2013 · >>>
Most viewed papers (last 30 days)
- Graphics Programming on the Web WebCL Course Notes
- Simulating the universe with GPU-accelerated supercomputers: n-body methods, tests, and examples
- Secrets from the GPU
- Implementations of the FFT algorithm on GPU
- Fluid Motion Modelling Using Vortex Particle Method on GPU
- Adding GPU Computing to Computer Organization Courses
- libWater: Heterogeneous Distributed Computing Made Easy
- Fast Implementation of Scale Invariant Feature Transform Based on CUDA
- Faster Upper Body Pose Estimation and Recognition Using CUDA
- Analyzing Locality of Memory References in GPU Architectures
Optimizing a Biomedical Imaging Orientation Score Framework
Graphics Programming on the Web WebCL Course Notes
Adaptive Dynamic Load Balancing in Heterogeneous Multiple GPUs-CPUs Distributed Setting: Case Study of B&B Tree Search
Duality based optical flow algorithms with applications
A parallel decoding algorithm of LDPC codes using CUDA
Optimizing MapReduce for GPUs with effective shared memory usage
OpenCL parallel Processing using General Purpose Graphical Processing units - TiViPE software development
Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling
Stencil-Aware GPU Optimization of Iterative Solvers
A General-Purpose GPU Reservoir Computer
October 1-4, 2013
November 13-15, 2013
February 2-6, 2014
San Francisco, USA
February 12-14, 2014
November 11-14, 2013
San Jose, California, USA
Registered users can now run their OpenCL application at hgpu.org. We provide 1 minute of computer time per each run on two nodes with two AMD and one nVidia graphics processing units, correspondingly. There are no restrictions on the number of starts.
The platforms are
- GPU device 0: AMD/ATI Radeon HD 5870 2GB, 850MHz
- GPU device 1: AMD/ATI Radeon HD 6970 2GB, 880MHz
- CPU: AMD Phenom II X6 @ 2.8GHz 1055T
- RAM: 12GB
- HDD: 2TB, Raid-0
- OS: OpenSUSE 11.4
- SDK: AMD APP SDK 2.8
- GPU device 0: AMD/ATI Radeon HD 7970 3GB, 1000MHz
- GPU device 1: nVidia GeForce GTX 560 Ti 2GB, 822MHz
- CPU: Intel Core i7-2600 @ 3.4GHz
- RAM: 16GB
- HDD: 2TB, Raid-0
- OS: OpenSUSE 12.2
- SDK: nVidia CUDA Toolkit 5.0.35, AMD APP SDK 2.8
Completed OpenCL project should be uploaded via User dashboard (see instructions and example there), compilation and execution terminal output logs will be provided to the user.