## Posts

Oct, 3

### Fast Algorithms for Convolutional Neural Networks

We derive a new class of fast algorithms for convolutional neural networks using Winograd’s minimal filtering algorithms. Specifically we derive algorithms for network layers with 3×3 kernels, which are the preferred kernel size for image recognition tasks. The best of our algorithms reduces arithmetic complexity up to 4X compared with direct convolution, while using small […]

Sep, 30

### Analysis of A Splitting Approach for the Parallel Solution of Linear Systems on GPU Cards

We discuss an approach for solving sparse or dense banded linear systems ${bf A} {bf x} = {bf b}$ on a Graphics Processing Unit (GPU) card. The matrix ${bf A} in {mathbb{R}}^{N times N}$ is possibly nonsymmetric and moderately large; i.e., $10000 leq N leq 500000$. The ${it split and parallelize}$ (${tt SaP}$) approach seeks […]

Sep, 29

### Performance Testing of GPU-Based Approximate Matching Algorithm on Network Traffic

Insider threat is one of the risks both government and private organizations have to deal with in protecting their important information. Data exfiltration and data leakage resulting from insiders activities can be very difficult to identify and quantify. Unfortunately, existing solutions that efficiently check whether data moving across a network is known to be sensitive […]

Sep, 29

### The Dynamical Kernel Scheduler – Part 1

Emerging processor architectures such as GPUs and Intel MICs provide a huge performance potential for high performance computing. However developing software using these hardware accelerators introduces additional challenges for the developer such as exposing additional parallelism, dealing with different hardware designs and using multiple development frameworks in order to use devices from different vendors. The […]

Sep, 29

### A Design Framework for Mapping Dataflow Graphs onto Heterogeneous Multiprocessor Platforms

Dataflow models are valuable tools for representing, analyzing, and synthesizing embedded systems. Heterogeneous computing platforms with multi-core CPU and Graphics Processing Units (GPUs) provide a low cost platform for high performance computations. In this report, we present a dataflow based automated design framework that incorporates analysis, optimization and synthesis tools for embedded systems. Our framework […]

Sep, 29

### Solving prime-field ECDLPs on GPUs with OpenCL

The intractability of the ECDLP is part of what makes many cryptographic application work. As such, viewing this problem from as many angles as possible is worthwhile. In this thesis, we explore the angle of creating a GPU ECDLP solver using OpenCL. In the process, we discuss the many issues, limitations and solutions we encounter. […]

Sep, 29

### TABLA: A Unified Template-based Framework for Accelerating Statistical Machine Learning

A growing number of commercial and enterprise systems increasingly rely on compute-intensive machine learning algorithms. While the demand for these compute-intensive applications is growing, the performance benefits from general-purpose platforms are diminishing. To accommodate the needs of machine learning algorithms, Field Programmable Gate Arrays (FPGAs) provide a promising path forward and represent an intermediate point […]

Sep, 26

### From Pixels to Torques: Policy Learning using Deep Dynamical Convolutional Networks

Data-efficient learning in continuous state-action spaces using high-dimensional observations remains an elusive challenge in developing fully autonomous systems. An instance of this challenge is the pixels to torques problem, which identifies key elements of an autonomous agent: autonomous thinking and decision making using sensor measurements only, learning from mistakes, and applying past experiences to novel […]

Sep, 26

### Fast Exact Bayesian Inference for High-Dimensional Models

In this text, we present the principles that allow the tractable implementation of exact inference processes concerning a group of widespread classes of Bayesian generative models, which have until recently been deemed as intractable whenever formulated using high-dimensional joint distributions. We will demonstrate the usefulness of such a principled approach with an example of real-time […]

Sep, 26

### A Survey of CUDA-based Multidimensional Scaling on GPU Architecture

The need to analyze large amounts of multivariate data raises the fundamental problem of dimensionality reduction which is defined as a process of mapping data from high-dimensional space into low-dimensional. One of the most popular methods for handling this problem is multidimensional scaling. Due to the technological advances, the dimensionality of the input data as […]

Sep, 26

### A GPU accelerated Barnes-Hut Tree Code for FLASH4

We present a GPU accelerated CUDA-C implementation of the Barnes Hut (BH) tree code for calculating the gravita- tional potential on octree adaptive meshes. The tree code algorithm is implemented within the FLASH4 adaptive mesh refinement (AMR) code framework and therefore fully MPI parallel. We describe the algorithm and present test results that demonstrate its […]

Sep, 26

### Efficient Simulation Techniques for Large-Scale Applications

Architecture simulation is an important performance modeling approach. Modeling hardware components with sufficient detail helps architects to identify both hardware and software bottlenecks. However, the major issue of architectural simulation is the huge slowdown compared to native execution. The slowdown gets higher for the emerging workloads that feature high throughput and massive parallelism, such as […]