Posts
Dec, 15
Origami: A Convolutional Network Accelerator
Today advanced computer vision (CV) systems of ever increasing complexity are being deployed in a growing number of application scenarios with strong real-time and power constraints. Current trends in CV clearly show a rise of neural network-based algorithms, which have recently broken many object detection and localization records. These approaches are very flexible and can […]
Dec, 15
Adaptive algebraic multigrid on SIMD architectures
We present details of our implementation of the Wuppertal adaptive algebraic multigrid code DD-alpha AMG on SIMD architectures, with particular emphasis on the Intel Xeon Phi processor (KNC) used in QPACE 2. As a smoother, the algorithm uses a domain-decomposition-based solver code previously developed for the KNC in Regensburg. We optimized the remaining parts of […]
Dec, 15
A CUDA Kernel Scheduler Exploiting Static Data Dependencies
The CUDA execution model of Nvidia’s GPUs is based on the asynchronous execution of thread blocks, where each thread executes the same kernel in a data-parallel fashion. When threads in different thread blocks need to synchronise and communicate, the whole computation launched onto the GPU needs to be stopped and re-invoked in order to facilitate […]
Dec, 15
Run-time support for multi-level disjoint memory address spaces
High Performance Computing (HPC) systems have become widely used tools in many industry areas and research fields. Research to produce more powerful and efficient systems has grown in par with their popularity. As a consequence, the complexity of modern HPC architectures has increased in order to provide systems with the highest levels of performance. This […]
Dec, 15
Bigger Buffer k-d Trees on Multi-Many-Core Systems
A buffer k-d tree is a k-d tree variant for massively-parallel nearest neighbor search. While providing valuable speed-ups on modern many-core devices in case both a large number of reference and query points are given, buffer k-d trees are limited by the amount of points that can fit on a single device. In this work, […]
Dec, 15
Compressed Dynamic Mode Decomposition for Real-Time Object Detection
We introduce the method of compressive dynamic mode decomposition (cDMD) for robustly performing real-time foreground/background separation in high-definition video. The DMD method provides a regression technique for least-square fitting of video snapshots to a linear dynamical system. The method integrates two of the leading data analysis methods in use today: Fourier transforms and Principal Components. […]
Dec, 15
A Survey Of Techniques for Cache Locking
Cache memory, although important for boosting application performance, is also a source of execution time variability, and this makes its use difficult in systems requiring worst case execution time (WCET) guarantees. Cache locking is a promising approach for simplifying WCET estimation and providing predictability and hence, several commercial processors provide ability for locking cache. However, […]
Dec, 14
Free-form interest rate term structure decomposition: a 2nd order optimization problem
The paper discusses an interest rate term structure decomposition method that breaks from the conventional, in that it does not superimpose any model, form or structure on the decomposition output – hence, the term free-form. The premise is simple: if the model does not presuppose any structure beforehand, and if the structure underlying the input […]
Dec, 12
A Scalable Lane Detection Algorithm on COTSs with OpenCL
Road lane detection are classical requirements for advanced driving assistant systems. With new computer technologies, lane detection algorithms can be exploited on COTS platforms. This paper investigates the use of OpenCL and develop a particle-filter based lane detection algorithm that can tune the trade-off between detection accuracy and speed. Our algorithm is tested on 14 […]
Dec, 12
Behavioral Non-portability in Scientific Numeric Computing
The precise semantics of floating-point arithmetic programs depends on the execution platform, including the compiler and the target hardware. Platform dependencies are particularly pronounced for arithmetic-intensive parallel numeric programs and infringe on the highly desirable goal of software portability (which is nonetheless promised by heterogeneous computing frameworks like OpenCL): the same program run on the […]
Dec, 12
Large-Scale Compute-Intensive Analysis via a Combined In-Situ and Co-Scheduling Workflow Approach
Large-scale simulations can produce hundreds of terabytes to petabytes of data, complicating and limiting the efficiency of work-flows. Traditionally, outputs are stored on the file system and analyzed in post-processing. With the rapidly increasing size and complexity of simulations, this approach faces an uncertain future. Trending techniques consist of performing the analysis in-situ, utilizing the […]
Dec, 12
Accelerating Exact Similarity Search on CPU-GPU Systems
In recent years, the use of Graphics Processing Units (GPUs) for data mining tasks has become popular. With modern processors integrating both CPUs and GPUs, it is also important to consider what tasks benefit from GPU processing and which do not, and apply a heterogeneous processing approach to improve the efficiency where applicable. Similarity search, […]