Posts
Dec, 6
Parallelization and Performance of the NIM Weather Model for CPU, GPU and MIC Processors
The design and performance of the NIM global weather prediction model is described. NIM was designed to run on GPU and MIC processors. It demonstrates efficient parallel performance and scalability to tens of thousands of compute nodes, and has been an effective way to make comparisons between traditional CPU and emerging fine-grain processors. Design of […]
Dec, 6
A Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling
Today, most high-performance computing (HPC) platforms have heterogeneous hardware resources (CPUs, GPUs, storage, etc.) A Graphics Processing Unit (GPU) is a parallel computing coprocessor specialized in accelerating vector operations. The prediction of application execution times over these devices is a great challenge and is essential for efficient job scheduling. There are different approaches to do […]
Dec, 6
ESE: Efficient Speech Recognition Engine with Compressed LSTM on FPGA
Long Short-Term Memory (LSTM) is widely used in speech recognition. In order to achieve higher prediction accuracy, machine learning scientists have built larger and larger models. Such large model is both computation intensive and memory intensive. Deploying such bulky model results in high power consumption and leads to high total cost of ownership (TCO) of […]
Dec, 6
GPU-accelerated algorithms for many-particle continuous-time quantum walks
Many-particle continuous-time quantum walks (CTQWs) represent a resource for several tasks in quantum technology, including quantum search algorithms and universal quantum computation. In order to design and implement CTQWs in a realistic scenario, one needs effective simulation tools for Hamiltonians that take into account static noise and fluctuations in the lattice, i.e. Hamiltonians containing stochastic […]
Dec, 3
OpenACC cache Directive: Opportunities and Optimizations
OpenACC’s programming model presents a simple interface to programmers, offering a trade-off between performance and development effort. OpenACC relies on compiler technologies to generate efficient code and optimize for performance. Among the difficult to implement directives, is the cache directive. The cache directive allows the programmer to utilize accelerator’s hardware- or software-managed caches by passing […]
Dec, 3
GeantV: from CPU to accelerators
The GeantV project aims to research and develop the next-generation simulation software describing the passage of particles through matter. While the modern CPU architectures are being targeted first, resources such as GPGPU, Intel© Xeon Phi, Atom or ARM cannot be ignored anymore by HEP CPU-bound applications. The proof of concept GeantV prototype has been mainly […]
Dec, 3
Accelerating string tokenization with FPGAs for IoT data handling equipment
This paper reports on the results of a study to accelerate string tokenization using FPGAs suitable for both IoT gateways and data center servers. The prototype developed with Xilinx High-Level Synthesis software runs at 200 MHz and processes up to 32 ASCII characters per clock cycle. It incorporates either OpenCL or our own framework (Volvox) […]
Dec, 3
Should I use TensorFlow?
Google’s Machine Learning framework TensorFlow was open-sourced in November 2015 [1] and has since built a growing community around it. TensorFlow is supposed to be flexible for research purposes while also allowing its models to be deployed productively. This work is aimed towards people with experience in Machine Learning considering whether they should use TensorFlow […]
Dec, 3
A Real-time Single Pulse Detection Algorithm for GPUs
The detection of non-repeating events in the radio spectrum has become an important area of study in radio astronomy over the last decade due to the discovery of fast radio bursts (FRBs). We have implemented a single pulse detection algorithm, for NVIDIA GPUs, which use boxcar filters of varying widths. Our code performs the calculation […]
Nov, 30
Hardware thread reordering to boost OpenCL throughput on FPGAs
Availability of OpenCL for FPGAs has raised new questions about the efficiency of massive thread-level parallelism on FPGAs. The general trend is toward creating deep pipelining and in-order execution of many OpenCL threads across a shared data-path. While this can be a very effective approach for regular kernels, its efficiency significantly diminishes for irregular kernels […]
Nov, 30
Parallelizing Word2Vec in Multi-Core and Many-Core Architectures
Word2vec is a widely used algorithm for extracting low-dimensional vector representations of words. State-of-the-art algorithms including those by Mikolov et al. have been parallelized for multi-core CPU architectures, but are based on vector-vector operations with "Hogwild" updates that are memory-bandwidth intensive and do not efficiently use computational resources. In this paper, we propose "HogBatch" by […]
Nov, 30
Optimization of Pattern Matching Algorithms for Multi- and Many-Core Platforms
Image and video compression play a major role in the world today, allowing the storage and transmission of large multimedia content volumes. However, the processing of this information requires high computational resources, hence the improvement of the computational performance of these compression algorithms is very important. The Multidimensional Multiscale Parser (MMP) is a pattern-matching-based compression […]