clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization

hgpu.org » Applications » Computer science » clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization

clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization

Jing Chen, Jianbin Fanga, Weifeng Liub, Tao Tang, Canqun Yang

College of Computer, National University of Defense Technology, Changsha, China

Future Generation Computer Systems, 2018

@article{chen2018clmf,

title={clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization},

author={Chen, Jing and Fang, Jianbin and Liu, Weifeng and Tang, Tao and Yang, Canqun},

journal={Future Generation Computer Systems},

year={2018},

publisher={Elsevier}

}

Download (PDF)

View

Source

Source codes

Package:

clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization

2485

views

Alternating least squares (ALS) has been proved to be an effective solver for matrix factorization in recommender systems. To speed up factorizing performance, various parallel ALS solvers have been proposed to leverage modern multi-cores and many-cores. Existing implementations are limited in either speed or portability. In this paper, we present an efficient and portable ALS solver (clMF) for recommender systems. On one hand, we diagnose the baseline implementation and observe that it lacks of the awareness of the hierarchical thread organization on modern hardware. To achieve high performance, we apply the thread batching technique, the fine-grained tiling technique and three architecture-specific optimizations. On the other hand, we implement the ALS solver in OpenCL so that it can run on various platforms (CPUs, GPUs and MICs). Based on the architectural specifics, we select a suitable code variant for each platform to efficiently map it to the underlying hardware. The experimental results show that our implementation performs 2.8x-15.7x faster on an Intel 16-core CPU, 23.9x-87.9x faster on an NVIDIA K20C GPU and 34.6x-97.1x faster on an AMD Fury X GPU than the baseline implementation. On the K20C GPU, our implementation also outperforms cuMF over different latent features ranging from 10 to 100 with various real-world recommendation datasets.

Tags: AMD Radeon R9 Fury X, ATI, Computer science, Factorization, Linear Algebra, nVidia, OpenCL, Package, Tesla K20

June 2, 2018 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org