15073

Posts

Dec, 10

Join Execution Using Fragmented Columnar Indices on GPU and MIC

The paper describes an approach to the parallel natural join execution on computing clusters with GPU and MIC Coprocessors. This approach is based on a decomposition of natural join relational operator using the column indices and domain-interval fragmentation. This decomposition admits parallel executing the resource-intensive relational operators without data transfers. All column index fragments are […]
Dec, 9

A Parallel Solver for Markov Decision Process in Crowd Simulations

Classic path finding algorithms are not adequate in real world path planning, where environment information is incomplete or dynamic and Markov Decision Processes have been used as an alternative. The problem with the MDP formalism is that its state space grows exponentially with the number of domain variables, and its inference methods grow with the […]
Dec, 8

A Semi-Automated Tool Flow for Roofline Anaylsis of OpenCL Kernels on Accelerators

We propose a tool-flow methodology that can be applied to analyze and track the performance of OpenCL applications on heterogeneous platforms. Using a case study on a datacenter representative workload, we evaluate our tool flow on three distinct heterogeneous platforms and demonstrate how it can be employed more widely to provide insight and track attainable […]
Dec, 8

Towards Memory-Efficient Answering of Tree-Shaped SPARQL Queries using GPUs

We present an idea of efficient query answering over an RDF dataset employing a consumer-grade graphic card for an efficient computation. We consider tree-shaped SPARQL queries and static datasets, to facilitate data mining over RDF graphs in warehouse-like setups. Reasons to see the poster: a) presentation of the approach with examples; b) possibility of discussion […]
Dec, 8

Scaling Deep Learning on Multiple In-Memory Processors

Deep learning methods are proven to be state-of-theart in addressing many challenges in machine learning domains. However, it comes at the cost of high computational requirements and energy consumption. The emergence of Processing In Memory (PIM) with diestacking technology presents an opportunity to speed up deep learning computation and reduce energy consumption by providing low-cost […]
Dec, 8

Nonlinear Dynamic Analysis Efficiency by Using a GPU Parallelization

A graphics processing unit (GPU) parallelization approach was implemented to improve the efficiency of nonlinear dynamic analysis. The GPU parallelization approach speeded up the computation of implicit time integration and reduced total calculation time. In addition, a parallel equations solver is introduced to solve the equation system. Numerical examples of reinforced concrete (RC) frames were […]
Dec, 8

MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems

MXNet is a multi-language machine learning (ML) library to ease the development of ML algorithms, especially for deep neural networks. Embedded in the host language, it blends declarative symbolic expression with imperative tensor computation. It offers auto differentiation to derive gradients. MXNet is computation and memory efficient and runs on various heterogeneous systems, ranging from […]
Dec, 6

A Survey Of Architectural Techniques for Managing Process Variation

Process variation –deviation in parameters from their nominal specifications– threatens to slow down and even pause technological scaling and mitigation of it is the way to continue the benefits of chip miniaturization. In this paper, we present a survey of architectural techniques for managing process variation (PV) in modern processors. We also classify these techniques […]
Dec, 6

Using Data Compression for Increasing Efficiency of Data Transfer Between Main Memory and Intel Xeon Phi Coprocessor or NVidia GPU in Parallel DBMS

The need to transfer data through PCI Express bus is considered as one of main bottlenecks in programming for manycore coprocessors and GPUs. This paper focuses on using data compression methods, such as RLE, Null Suppression, LZSS and combination of RLE and Null Suppression to increase efficiency of data transfer between main memory and coprocessor. […]
Dec, 6

A Study of Parallel Sorting Algorithms Using CUDA and OpenMP

This thesis reviews the parallel languages according to their computational complexities, in terms of time, while using sorting algorithms coded in CUDA and OpenMP. The thesis evaluates the solution for parallelism at a maintainable cost of money and other efforts, for achieving acceptable results of timing when compared to parallel languages together, as well as […]
Dec, 6

Parallel Implementation of Vortex Element Method on CPUs and GPUs

The implementations of 2D vortex element method adapted to different types of parallel computers are considered. The developed MPI-implementation provides close to linear acceleration for small number of computational cores and approximately 40-times acceleration for 80-cores cluster when solving model problem. OpenMP-based modification allows to obtain 5% additional acceleration due to shared memory usage. Approximate […]
Dec, 6

CuMF: scale matrix factorization using just ONE machine with GPUs

Matrix factorization (MF) is widely used in recommendation systems. We present cuMF, a highly-optimized matrix factorization tool with supreme performance on graphics processing units (GPUs) by fully utilizing the GPU compute power and minimizing the overhead of data movement. Firstly, we introduce a memoryoptimized alternating least square (ALS) method by reducing discontiguous memory access and […]

Recent source codes

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us: