Attention Distribution-Aware Softmax for NPU-Accelerated On-Device Inference of LLMs: An Edge-Oriented Approximation Design
學年 114
學期 2
出版(發表)日期 2026-03-20
作品名稱 Attention Distribution-Aware Softmax for NPU-Accelerated On-Device Inference of LLMs: An Edge-Oriented Approximation Design
作品名稱(其他語言)
著者 Sadheerthan, S., M.-J. Hsu, C.-H. Huang and Y.-T. Wang
單位
出版者
著錄名稱、卷期、頁數 Electronics, vol.15, 1312
摘要 Low-power NPUs enable on-device LLM inference through efficient integer and fixed-point algebra, yet their lack of native exponential support makes Transformer softmax a critical performance bottleneck. Existing NPU kernels approximate e^x using uniform piecewise polynomials to enable O(1) SIMD indexing, but this wastes computation by applying high-degree arithmetic indiscriminately in every segment. Conversely, fully adaptive approaches maximize statistical fidelity but introduce pipeline stalls due to comparator-based boundary search. To bridge this gap, we propose an attention distribution-aware softmax that uses Particle Swarm Optimization (PSO) to define non-uniform segments and variable polynomial degrees, prioritizing finer granularity and lower arithmetic complexity in attention-dense regions. To ensure efficiency, we snap boundaries into a 128-bin LUT, enabling O(1) retrieval of segment parameters without branching. Inference measurements show that this favors low-degree execution, minimizing exp-kernel overhead. Using TinyLlama-1.1B-Chat as a testbed, the proposed weighted design reduces cycles per call exp kernel (CPC) by 18.5% versus an equidistant uniform Degree-4 baseline and 13.1% versus uniform Degree-3, while preserving ranking fidelity. These results show that grid-snapped, variable-degree approximation can improve softmax efficiency while largely preserving attention ranking fidelity, enabling accurate edge LLM inference.
關鍵字 Large Language Models (LLMs); softmax approximation; Multi-Objective Optimization (MOO); Edge AI; distribution-aware optimization; Neural Processing Units (NPUs) acceleration
語言 en
ISSN 2079-9292
期刊性質 國外
收錄於 SCI
產學合作 國內
通訊作者 王銀添
審稿制度 1
國別 CHE
公開徵稿
出版型式 ,電子版
SDGS 優質教育