high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Fast Parallel Machine Learning Algorithms for Large Datasets Using Graphic Processing Unit

Fast Parallel Machine Learning Algorithms for Large Datasets Using Graphic Processing Unit

Qi Li

Virginia Commonwealth University, Richmond, Virginia

Virginia Commonwealth University, 2011

BibTeX

Download (PDF)

View

Source

2459

views

This dissertation deals with developing parallel processing algorithms for Graphic Processing Unit (GPU) in order to solve machine learning problems for large datasets. In particular, it contributes to the development of fast GPU based algorithms for calculating distance (i.e. similarity, affinity, closeness) matrix. It also presents the algorithm and implementation of a fast parallel Support Vector Machine (SVM) using GPU. These application tools are developed using Compute Unified Device Architecture (CUDA), which is a popular software framework for General Purpose Computing using GPU (GPGPU). Distance calculation is the core part of all machine learning algorithms because the closer the query is to some samples (i.e. observations, records, entries), the more likely the query belongs to the class of those samples. K-Nearest Neighbors Search (k-NNS) is a popular and powerful distance based tool for solving classification problem. It is the prerequisite for training local model based classifiers. Fast distance calculation can significantly improve the speed performance of these classifiers and GPUs can be very handy for their accelerations. Meanwhile, several GPU based sorting algorithms are also included to sort the distance matrix and seek for the k-nearest neighbors. The speed performances of the sorting algorithms vary depending upon the input sequences. The GPUKNN proposed in this dissertation utilizes the GPU based distance computation algorithm and automatically picks up the most suitable sorting algorithm according to the characteristics of the input datasets. Every machine learning tool has its own pros and cons. The advantage of SVM is the high classification accuracy. This makes SVM possibly the best classification tool. However, as in many other machine learning algorithms, SVM’s slow training phase slows down when the size of the input datasets increase. The GPU version of parallel SVM based on parallel Sequential Minimal Optimization (SMO) implemented in this dissertation is proposed to reduce the time cost in both training and predicting phases. This implementation of GPUSVM is original. It utilizes many parallel processing techniques to accelerate and minimize the computations of kernel evaluation, which are considered as the most time consuming operations in SVM. Although the many-core architecture of GPU performs the best in data level parallelism, multi-task (aka. task level parallelism) processing is also integrated into the application to improve the speed performance of tasks such as multiclass classification and cross-validation. Furthermore, the procedure of finding worst violators is distributed to multiple blocks on the CUDA model. This reduces the time cost for each iteration of SMO during the training phase. All of these violators are shared among different tasks in multiclass classification and cross-validation to reduce the duplicate kernel computations. The speed performance results have shown that the achieved speedup of both the training phase and predicting phase are ranging from one order of magnitude to three orders of magnitude times faster compared to the state of the art LIBSVM software on some well known benchmarking datasets.

Tags: Algorithms, Benchmarking, Computer science, CUDA, Machine learning, Nearest neighbour, nVidia, Optimization, Sorting, Tesla C1060, Tesla C2050, Tesla C2070, Thesis

December 26, 2011 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Fast Parallel Machine Learning Algorithms for Large Datasets Using Graphic Processing Unit

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Most viewed papers (last 30 days)

Fast Parallel Machine Learning Algorithms for Large Datasets Using Graphic Processing Unit

Share this:

Recent source codes

Most viewed papers (last 30 days)