Personalized Daily Arxiv Papers 07/16/2025

Total relevant papers: 1

Paper selection prompt and criteria at the bottom

Table of contents with paper titles:

First-Order Error Matters: Accurate Compensation for Quantized Large Language Models Authors: Xingyu Zheng, Haotong Qin, Yuye Li, Jiakai Wang, Jinyang Guo, Michele Magno, Xianglong Liu

0. First-Order Error Matters: Accurate Compensation for Quantized Large Language Models

ArXiv ID: 2507.11017 Authors: Xingyu Zheng, Haotong Qin, Yuye Li, Jiakai Wang, Jinyang Guo, Michele Magno, Xianglong Liu

Abstract: arXiv:2507.11017v1 Announce Type: new Abstract: Post-training quantization (PTQ) offers an efficient approach to compressing large language models (LLMs), significantly reducing memory access and computational costs. Existing compensation-based weight calibration methods often rely on a second-order Taylor expansion to model quantization error, under the assumption that the first-order term is negligible in well-trained full-precision models. However, we reveal that the progressive compensation process introduces accumulated first-order deviations between latent weights and their full-precision counterparts, making this assumption fundamentally flawed. To address this, we propose FOEM, a novel PTQ method that explicitly incorporates first-order gradient terms to improve quantization error compensation. FOEM approximates gradients by directly computing the difference between latent and full-precision weights, avoiding the high cost and limited generalization of backpropagation-based gradient computation. This approach introduces minimal additional computational overhead. Moreover, FOEM leverages precomputed Cholesky factors to efficiently recover the inverse of Hessian submatrices in real time. Extensive experiments across a wide range of models and benchmarks demonstrate that FOEM consistently outperforms the classical GPTQ method. In 3-bit weight-only quantization, FOEM reduces the perplexity of Llama3-8B by 89.6%, and improves the 5-shot MMLU accuracy of Llama3-70B from 51.7% to 74.9%, approaching the full-precision performance of 78.6%. Furthermore, FOEM can be seamlessly integrated with advanced techniques such as GPTAQ and SpinQuant, yielding additional improvements under the challenging W4A4KV4 setting, and further narrowing the accuracy gap with full-precision baselines beyond what current state-of-the-art methods achieve. The code is available at https://github.com/Xingyu-Zheng/FOEM.

Paper selection prompt

Quantization, Pruning, and KVCache Compression for Efficient Large Language Models
- Relevant: To optimize large language models (LLMs) for reduced memory and computational costs while maintaining performance, this research direction integrates quantization, pruning, and KVCache compression into a unified framework. In suggesting papers to your friend, remember that he enjoys papers on large language model (llm) compression, inference acceleration techniques, and related algorithms such as quantization and pruning. He is particularly interested in research that explores new methods for reducing large language model size and improving inference speed, as well as innovative approaches to optimizing large language model.