27795

GPU-based Private Information Retrieval for On-Device Machine Learning Inference

Maximilian Lam, Jeff Johnson, Wenjie Xiong, Kiwan Maeng, Udit Gupta, Minsoo Rhu, Hsien-Hsin S. Lee, Vijay Janapa Reddi, Gu-Yeon Wei, David Brooks, Edward Suh
Meta AI
arXiv:2301.10904 [cs.CR], (26 Jan 2023)

@misc{https://doi.org/10.48550/arxiv.2301.10904,

   doi={10.48550/ARXIV.2301.10904},

   url={https://arxiv.org/abs/2301.10904},

   author={Lam, Maximilian and Johnson, Jeff and Xiong, Wenjie and Maeng, Kiwan and Gupta, Udit and Rhu, Minsoo and Lee, Hsien-Hsin S. and Reddi, Vijay Janapa and Wei, Gu-Yeon and Brooks, David and Suh, Edward},

   keywords={Cryptography and Security (cs.CR), Distributed, Parallel, and Cluster Computing (cs.DC), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},

   title={GPU-based Private Information Retrieval for On-Device Machine Learning Inference},

   publisher={arXiv},

   year={2023},

   copyright={Creative Commons Attribution 4.0 International}

}

Download Download (PDF)   View View   Source Source   

648

views

On-device machine learning (ML) inference can enable the use of private user data on user devices without remote servers. However, a pure on-device solution to private ML inference is impractical for many applications that rely on embedding tables that are too large to be stored on-device. To overcome this barrier, we propose the use of private information retrieval (PIR) to efficiently and privately retrieve embeddings from servers without sharing any private information during on-device ML inference. As off-the-shelf PIR algorithms are usually too computationally intensive to directly use for latency-sensitive inference tasks, we 1) develop a novel algorithm for accelerating PIR on GPUs, and 2) co-design PIR with the downstream ML application to obtain further speedup. Our GPU acceleration strategy improves system throughput by more than 20x over an optimized CPU PIR implementation, and our co-design techniques obtain over 5x additional throughput improvement at fixed model quality. Together, on various on-device ML applications such as recommendation and language modeling, our system on a single V100 GPU can serve up to 100,000 queries per second — a >100x throughput improvement over a naively implemented system — while maintaining model accuracy, and limiting inference communication and response latency to within 300KB and <100ms respectively.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: