Parallel paradigms in optimal structural design

hgpu.org » Applications » Computer science » Parallel paradigms in optimal structural design

Parallel paradigms in optimal structural design

Van Huyssteen, Salomon Stephanus

Department of Mechanical and Mechatronic Engineering, University of Stellenbosch, Private Bag X1, Matieland 7602, South Africa

University of Stellenbosch, 2011

@article{van2011parallel,

title={Parallel paradigms in optimal structural design},

author={Van Huyssteen, S.S.},

year={2011},

publisher={Stellenbosch: Stellenbosch University}

}

Download (PDF)

View

Source

1861

views

Modern-day processors are not getting any faster. Due to the power consumption limit of frequency scaling, parallel processing is increasingly being used to decrease computation time. In this thesis, several parallel paradigms are used to improve the performance of commonly serial SAO programs. Four novelties are discussed: First, replacing double precision solvers with single precision solvers. This is attempted in order to take advantage of the anticipated factor 2 speed increase that single precision computations have over that of double precision computations. However, single precision routines present unpredictable performance characteristics and struggle to converge to required accuracies, which is unfavourable for optimization solvers. Second, QP and dual are statements pitted against one another in a parallel environment. This is done because it is not always easy to see which is best a priori. Therefore both are started in parallel and the competing threads are cancelled as soon as one returns a valid point. Parallel QP vs. dual statements prove to be very attractive, converging within the minimum number of outer iterations. The most appropriate solver is selected as the problem properties change during the iteration steps. Thread cancellation poses problems caused by threads having to wait to arrive at appropriate checkpoints, thus su ering from unnecessarily long wait times because of struggling competing routines. Third, multiple global searches are started in parallel on a shared memory system. Problems see a speed increase of nearly 4x for all problems. Dynamically scheduled threads alleviate the need for set thread amounts, as in message passing implementations. Lastly, the replacement of existing matrix-vector multiplication routines with optimized BLAS routines, especially BLAS routines targeted at GPGPU technologies (graphics processing units), proves to be superior when solving large matrix-vector products in an iterative environment. These problems scale well within the hardware capabilities and speedups of up to 36x are recorded.

Tags: Computer science, CUDA, nVidia, nVidia GeForce GTX 280, OpenCL, Optimization, Thesis

December 18, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

high performance computing on graphics processing units: hgpu.org