Posts
Mar, 31
GPGPU-Accelerated Instruction Accurate and Fast Simulation of Thousand-core Platforms
Future architectures will feature hundreds to thousands of simple processors and on-chip memories connected through a network-on-chip. Architectural simulators will remain primary tools for design space exploration, performance (and power) evaluation of these massively parallel architectures. However, architectural simulation performance is a serious concern, as virtual platforms and simulation technology are not able to tackle […]
Mar, 31
Adaptive Input-aware Compilation for Graphics Engines
While graphics processing units (GPUs) provide low-cost and efficient platforms for accelerating high performance computations,the tedious process of performance tuning required to optimize applicationsis an obstacle to wider adoption of GPUs. In addition to the programmability challenges posed by GPU’s complex memory hierarchy and parallelism model, a well-known application design problem is target portability across […]
Mar, 30
A Highly Parallel Reuse Distance Analysis Algorithm on GPUs
Reuse distance analysis is a runtime approach that has been widely used to accurately model the memory system behavior of applications. However, traditional reuse distance analysis algorithms use tree-based data structures and are hard to parallelize, missing the tremendous computing power of modern architectures such as the emerging GPUs. This paper presents a highly-parallel reuse […]
Mar, 30
Optimized Strategies for Mapping Three-dimensional FFTs onto CUDA GPUs
We address in this paper the problem of mapping three-dimensional Fast Fourier Transforms (FFTs) onto the recent, highly multithreaded CUDA Graphics Processing Units (GPUs) and present some of the fastest known algorithms for a wide range of 3-D FFTs on the NVIDIA Tesla and Fermi architectures. We exploit the high-degree of multi-threading offered by the […]
Mar, 30
Performance evaluation of GPU memory hierarchy using the FFT
Modern GPUs (Graphics Processing Units) are becoming more relevant in the world of HPC (High Performance Computing) thanks to their large computing power and relative low cost, however their special architecture results in more complex programming. To take advantage of their computing resources and develop efficient implementations is essential to have certain knowledge about the […]
Mar, 30
A Performance Analysis Framework for Identifying Potential Benefits in GPGPU Applications
Tuning code for GPGPU and other emerging many-core platforms is a challenge because few models or tools can precisely pinpoint the root cause of performance bottlenecks. In this paper, we present a performance analysis framework that can help shed light on such bottlenecks for GPGPU applications. Although a handful of GPGPU profiling tools exist, most […]
Mar, 29
Scheduling Tasks over Multicore machines enhanced with Accelerators: a Runtime System’s Perspective
Multicore machines equipped with accelerators are becoming increasingly popular in the High Performance Computing ecosystem. Hybrid architectures provide significantly improved energy efficiency, so that they are likely to generalize in the Manycore era. However, the complexity introduced by these architectures has a direct impact on programmability, so that it is crucial to provide portable abstractions […]
Mar, 29
A computing origami: Optimized code generation for emerging parallel platforms
This thesis deals with code generation for parallel applications on emerging platforms, in particular FPGA and GPU-based platforms. These platforms expose a large design space, throughout which performance is affected by significant architectural idiosyncrasies. In this context, generating efficient code is a global optimization problem. The code generation methods described in this thesis apply to […]
Mar, 29
Multicore Processing for Clustering Algorithms
Data Mining algorithms such as classification and clustering are the future of computation, though multidimensional data-processing is required. People are using multicore processors with GPU’s. Most of the programming languages doesn’t provide multiprocessing facilities and hence wastage of processing resources. Clustering and classification algorithms are more resource consuming. In this paper we have shown strategies […]
Mar, 29
A Massively Parallel Approach for Nonlinear Interdependency Analysis of Multivariate Signals with GPGPU
Nonlinear interdependency (NLI) analysis is an effective method for measurement of synchronization among brain regions, which is an important feature of normal and abnormal brain functions. But its application in practice has long been largely hampered by the ultra-high complexity of the NLI algorithms. We developed a massively parallel approach to address this problem. The […]
Mar, 29
Machine Learning for Predictive Auto-Tuning with Boosted Regression Trees
Auto-tuning is a widely used and effective technique for optimizing a parametrized GPU code template for a particular computation on particular hardware. Its drawback is that thorough or exhaustive auto-tuning requires compiling many kernels and calling each one many times; this process is slow. Furthermore, library abstraction boundaries provide operations such as image filtering and […]
Mar, 28
Auto-tuning a High-Level Language Targeted to GPU Codes
Determining the best set of optimizations to apply to a kernel to be executed on the graphics processing unit (GPU) is a challenging problem. There are large sets of possible optimization configurations that can be applied, and many applications have multiple kernels. Each kernel may require a specific configuration to achieve the best performance, and […]