- •ApplicationsWhere it's
- •HardwareSpecs and
- •ProgrammingAlgorithms and techniques
- •ResourcesSource codes,
tutorials, books, etc.
The most recent entries
Analysis of Multicore CPU and GPU Toward Parallelization of Total Focusing Method Ultrasound Reconstruction
Ultrasonic imaging and reconstruction tools are com-monly used to detect, identify and measure defects in different mechanical parts. Due to the complexity of the underlying physics, and due to the evergrowing quantity of acquired data, computation time is becoming a limitation to the opti-mal inspection of a mechanical part. This article presents the performances of several implementations of a computational heavy algorithm, named Total Focusing Method, on both gra-phics processing units (GPU) and general purpose processors (GPP). The scope of this study is narrowed to planar parts tested...
This paper presents a novel local contrast enhancement algorithm based on local histogram modification. The computation of local contrast enhancement operators is usually slow though they produce better local contrast and details. We have addressed this issue by subtly designing a highly parallel algorithm, which could be easily implemented on Graphics Processing Units (GPU) to harvest high computational efficiency. Our method is fast and easy to use, and the experiment results show that the technique can produce good results on a variety of images.
Tiling is a key technique to enhance data reuse. For computations structured as one sequential outer "time" loop enclosing a set of parallel inner loops, tiling only the parallel inner loops may not enable enough data reuse in the cache. Tiling the inner loops along with the outer time loop enhances data locality but may require other transformations like loop skewing that inhibit inter-tile parallelism. One approach to tiling that enhances data locality without inhibiting inter-tile parallelism is split tiling, where tiles are subdivided into a sequence of trapezoidal computation...
In this paper, we present an image object tracking system for GPGPU based CAMshift algorithm. For image object tracking, we use the parallel CAMshift tracking algorithm based on the HSV color image distribution of detected moving objects. In this, RGB-to-HSV color conversion, image masking such as open and close operation for image morphology, and computing of centroid are executed in parallel. CAMshift algorithm is very efficient for real-time tracking because of its fast and robust performance. In this system, CUDA environment and C++ program are used for image processing and accessing the...
This research aims to accelerate the computation of the Kleene star in max-plus algebra using CUDA technology on graphics processing units (GPUs). The target module is the Kleene star of a weighted adjacency matrix for directed acyclic graph (DAGs) which plays an essential role in calculating the earliest and/or latest schedule for a class of discrete event systems. In recent NVIDIA GPU cards, an environment for high performance computing is provided to general developers, for which we aim to exploit the benefit of using GPUs. Using an NVIDIA Tesla C2075 for our experiments, we obtained...
As graphics processing units (GPUs) are continually being utilized as coprocessors, the demand for optimally utilizing them for various applications continues to grow. This work narrows the gap between programmers and minimum execution time for matrix-based computations on a GPU. To minimize execution time, computation and communication time must be considered. For computation, the placement of data in GPU memory significantly affects computation time and therefore is considered. Various matrix-based computation patterns are examined with respect to the layout of GPU memory. A computation...
In this paper, we present techniques that coordinate the thread scheduling and prefetching decisions in a General Purpose Graphics Processing Unit (GPGPU) architecture to better tolerate long memory latencies. We demonstrate that existing warp scheduling policies in GPGPU architectures are unable to effectively incorporate data prefetching. The main reason is that they schedule consecutive warps, which are likely to access nearby cache blocks and thus prefetch accurately for one another, back-to-back in consecutive cycles. This either 1) causes prefetches to be generated by a warp too close...
Multimedia applications are present in most mobile hand-held devices. The H.264 standard is currently dominating the video compression world. H.264 has high computational complexity requiring large amount of processing resources. Many techniques emerged that optimize H.264 using parallelization on multicore systems ranging from groups of pictures until the smallest block of pixels. We propose a parallelization technique based on rows of macroblocks with a light dependency detection algorithm that optimizes data parallelization and minimizes dependency synchronization stall time. The parallel...
The volume of banks data calculation is increasing each year with extraordinary scale and with that, new forms of computation is needed. High performance computing is a very attractive field for optimization such bank calculous, which can give promising results. This paper shows a implementation of know model for assessing the credit risk of a company. For getting most accurate price and speedup comparisson, this method was implemented in both CPU and GPU version. The Gpu version was builtt using CUDA architecture and show some reasons and advantages of using such the Gpu computing for...
In recent years, the computational power of modern processors has been increasing mainly because of the increase in the number of processor cores. Computationally intensive applications can gain from this trend only if they employ parallelism, such as thread-level parallelization. Geometric simulations can employ thread-level parallelization because the main part of a geometric simulation can be divided into a subset of mutually independent tasks. This approach is especially interesting for acoustic beam tracing because it is an intensive computing task. This paper presents the...
A modified parallel variable distribution (PVD) algorithm for solving large-scale constrained optimization problems is developed, which modifies quadratic subproblem QPl at each iteration instead of the QPl of the SQP-type PVD algorithm proposed by C. A. Sagastizabal and M. V. Solodov in 2002. The algorithm can circumvent the difficulties associated with the possible inconsistency of subproblem of the original SQP method. Moreover, we introduce a nonmonotone technique instead of the penalty function to carry out the line search procedure with more flexibly. Under appropriate conditions, the...
Encryption of real-time multimedia data transfers is one of the tasks for telecommunication infrastructure which should be considered in order to reach essential level of security. Execution time of ciphering algorithm could play fundamental role in delay of the packets, therefore, it provides interesting challenge in terms of optimization methods. This work focuses on parallelization possibilities of processing SRTP for the purposes of private gateway with the usage of OpenCL framework, utilization gateway's resources and analysis of potential improvement.
Most viewed papers (last 30 days)
- Graphics Programming on the Web WebCL Course Notes
- Simulating the universe with GPU-accelerated supercomputers: n-body methods, tests, and examples
- Secrets from the GPU
- Implementations of the FFT algorithm on GPU
- Fluid Motion Modelling Using Vortex Particle Method on GPU
- Adding GPU Computing to Computer Organization Courses
- libWater: Heterogeneous Distributed Computing Made Easy
- Fast Implementation of Scale Invariant Feature Transform Based on CUDA
- Faster Upper Body Pose Estimation and Recognition Using CUDA
- Analyzing Locality of Memory References in GPU Architectures
Optimizing a Biomedical Imaging Orientation Score Framework
Graphics Programming on the Web WebCL Course Notes
Adaptive Dynamic Load Balancing in Heterogeneous Multiple GPUs-CPUs Distributed Setting: Case Study of B&B Tree Search
Duality based optical flow algorithms with applications
In-Place Recursive Approach for All-Pairs Shortest Paths Problem Using OpenCL
A parallel decoding algorithm of LDPC codes using CUDA
Optimizing MapReduce for GPUs with effective shared memory usage
OpenCL parallel Processing using General Purpose Graphical Processing units - TiViPE software development
Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling
Stencil-Aware GPU Optimization of Iterative Solvers
October 1-4, 2013
November 13-15, 2013
February 2-6, 2014
San Francisco, USA
February 12-14, 2014
November 11-14, 2013
San Jose, California, USA
Registered users can now run their OpenCL application at hgpu.org. We provide 1 minute of computer time per each run on two nodes with two AMD and one nVidia graphics processing units, correspondingly. There are no restrictions on the number of starts.
The platforms are
- GPU device 0: AMD/ATI Radeon HD 5870 2GB, 850MHz
- GPU device 1: AMD/ATI Radeon HD 6970 2GB, 880MHz
- CPU: AMD Phenom II X6 @ 2.8GHz 1055T
- RAM: 12GB
- HDD: 2TB, Raid-0
- OS: OpenSUSE 11.4
- SDK: AMD APP SDK 2.8
- GPU device 0: AMD/ATI Radeon HD 7970 3GB, 1000MHz
- GPU device 1: nVidia GeForce GTX 560 Ti 2GB, 822MHz
- CPU: Intel Core i7-2600 @ 3.4GHz
- RAM: 16GB
- HDD: 2TB, Raid-0
- OS: OpenSUSE 12.2
- SDK: nVidia CUDA Toolkit 5.0.35, AMD APP SDK 2.8
Completed OpenCL project should be uploaded via User dashboard (see instructions and example there), compilation and execution terminal output logs will be provided to the user.