15127

Posts

Dec, 19

Study, Modelling and Implementation of the Level Set Method Used in Micromachining Processes

The main topic of the present thesis is the improvement of fabrication processes simulation by means of the Level Set (LS) method. The LS is a mathematical approach used for evolving fronts according to a motion defined by certain laws. The main advantage of this method is that the front is embedded inside a higher […]
Dec, 19

Challenges Adapting CUDA PIC Codes to multiple GPUs

A Particle-In-Cell code is a common particle simulation method often used to simulate the behaviour of plasma. In this work, a parallel PIC code is developed in CUDA, with a focus on how to adapt the method for multiple GPUs. An electrostatic three dimensional PIC code is developed, with an FFT-based solver using the cuFFT […]
Dec, 19

Efficient Query Processing in Co-Processor-accelerated Databases

Advancements in hardware changed the bottleneck of modern database systems from disk IO to main memory access and processing power. Since the performance of modern processors is primarily limited by a fixed energy budget, hardware vendors are forced to specialize processors. Consequently, processors become increasingly heterogeneous, which already became commodity in the form of accelerated […]
Dec, 15

Origami: A Convolutional Network Accelerator

Today advanced computer vision (CV) systems of ever increasing complexity are being deployed in a growing number of application scenarios with strong real-time and power constraints. Current trends in CV clearly show a rise of neural network-based algorithms, which have recently broken many object detection and localization records. These approaches are very flexible and can […]
Dec, 15

Adaptive algebraic multigrid on SIMD architectures

We present details of our implementation of the Wuppertal adaptive algebraic multigrid code DD-alpha AMG on SIMD architectures, with particular emphasis on the Intel Xeon Phi processor (KNC) used in QPACE 2. As a smoother, the algorithm uses a domain-decomposition-based solver code previously developed for the KNC in Regensburg. We optimized the remaining parts of […]
Dec, 15

A CUDA Kernel Scheduler Exploiting Static Data Dependencies

The CUDA execution model of Nvidia’s GPUs is based on the asynchronous execution of thread blocks, where each thread executes the same kernel in a data-parallel fashion. When threads in different thread blocks need to synchronise and communicate, the whole computation launched onto the GPU needs to be stopped and re-invoked in order to facilitate […]
Dec, 15

Run-time support for multi-level disjoint memory address spaces

High Performance Computing (HPC) systems have become widely used tools in many industry areas and research fields. Research to produce more powerful and efficient systems has grown in par with their popularity. As a consequence, the complexity of modern HPC architectures has increased in order to provide systems with the highest levels of performance. This […]
Dec, 15

Bigger Buffer k-d Trees on Multi-Many-Core Systems

A buffer k-d tree is a k-d tree variant for massively-parallel nearest neighbor search. While providing valuable speed-ups on modern many-core devices in case both a large number of reference and query points are given, buffer k-d trees are limited by the amount of points that can fit on a single device. In this work, […]
Dec, 15

Compressed Dynamic Mode Decomposition for Real-Time Object Detection

We introduce the method of compressive dynamic mode decomposition (cDMD) for robustly performing real-time foreground/background separation in high-definition video. The DMD method provides a regression technique for least-square fitting of video snapshots to a linear dynamical system. The method integrates two of the leading data analysis methods in use today: Fourier transforms and Principal Components. […]
Dec, 15

A Survey Of Techniques for Cache Locking

Cache memory, although important for boosting application performance, is also a source of execution time variability, and this makes its use difficult in systems requiring worst case execution time (WCET) guarantees. Cache locking is a promising approach for simplifying WCET estimation and providing predictability and hence, several commercial processors provide ability for locking cache. However, […]
Dec, 14

Free-form interest rate term structure decomposition: a 2nd order optimization problem

The paper discusses an interest rate term structure decomposition method that breaks from the conventional, in that it does not superimpose any model, form or structure on the decomposition output – hence, the term free-form. The premise is simple: if the model does not presuppose any structure beforehand, and if the structure underlying the input […]
Dec, 12

Behavioral Non-portability in Scientific Numeric Computing

The precise semantics of floating-point arithmetic programs depends on the execution platform, including the compiler and the target hardware. Platform dependencies are particularly pronounced for arithmetic-intensive parallel numeric programs and infringe on the highly desirable goal of software portability (which is nonetheless promised by heterogeneous computing frameworks like OpenCL): the same program run on the […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us: