high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Profiling Apple Silicon Performance for ML Training

Profiling Apple Silicon Performance for ML Training

Dahua Feng, Zhiming Xu, Rongxiang Wang, Felix Xiaozhu Lin

University of Virginia

arXiv:2501.14925 [cs.PF], (28 Jan 2025)

DOI:10.48550/arXiv.2501.14925

@misc{feng2025profilingapplesiliconperformance,

title={Profiling Apple Silicon Performance for ML Training},

author={Dahua Feng and Zhiming Xu and Rongxiang Wang and Felix Xiaozhu Lin},

year={2025},

eprint={2501.14925},

archivePrefix={arXiv},

primaryClass={cs.PF},

url={https://arxiv.org/abs/2501.14925}

}

Download (PDF)

View

Source

1740

views

Apple Silicon has attracted much attention for its performance and role in machine learning (ML) training. Unlike NVIDIA GPUs, which have traditionally dominated ML training, Apple Silicon has a significant difference in memory architecture. It uses Unified Memory, which integrates CPU and GPU memory instead of separate CPU memory and GPU VRAM. However, it is difficult to tell whether Unified Memory means more performance benefits. This paper investigates the performance differences by training several large language model (LLM) workloads end-to-end under different memory scenarios. The results show a significant performance gap between Apple Silicon and NVIDIA GPUs. This paper attributes this gap to system-level factors such as page faults, power consumption, and kernel launch time. In addition, the performance difference of basic linear algebra subprograms (BLAS) on the NVIDIA GPUs and Apple Silicon chips is analyzed to further explain the observed gap.

Tags: AI, Apple M2 Max, Apple M2 Pro, Apple M2 Ultra, Computer science, CUDA, Linear Algebra, LLM, Machine learning, nVidia, nVidia GeForce RTX 4090, nVidia GeFroce RTX 2080 Ti, nVidia Quadro RTX 4000, nVidia RTX A6000, Performance, PyTorch

February 3, 2025 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Profiling Apple Silicon Performance for ML Training

Your response

Recent source codes

CrossTL: Universal Programming Language & Translator

TBD-GPU

DG-SWEM - The Discontinuous Galerkin Shallow Water Equation Model

torchPDLP: Primal-Dual Linear Programming in PyTorch. In collaboration with AMD and IPAM

Benchmarks for Dissecting CPU-GPU Unified Physical Memory on AMD MI300A APUs

kvcached: Elastic KV cache for dynamic GPU sharing and efficient multi-LLM inference

Bandicoot: C++ library for GPU accelerated linear algebra

Luthier: Bridging Auto-Tuning and Vendor Libraries for Efficient Deep Learning Inference

Fused Kernel Library (FKL)

GPUHammer: Rowhammer Attacks on GPU Memories are Practical

Most viewed papers (last 30 days)

Profiling Apple Silicon Performance for ML Training

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)