A Review of State-of-the-Art Techniques for Large Language Model Compression

Pierre V. Dantas, Lucas C. Cordeiro, Waldir S. S. Junior

Research output: Contribution to journalArticlepeer-review

Abstract

The rapid advancement of large language models (LLMs) has driven significant progress in natural language processing (NLP) and related domains. However, their deployment remains constrained by challenges related to computation, memory, and energy efficiency – particularly in real-world applications. This work presents a comprehensive review of state-of-the-art compression techniques, including pruning, quantization, knowledge distillation, and neural architecture search (NAS), which collectively aim to reduce model size, enhance inference speed, and lower energy consumption while maintaining performance. A robust evaluation framework is introduced, incorporating traditional metrics, such as accuracy and perplexity (PPL), alongside advanced criteria including latency-accuracy trade-offs, parameter efficiency, multi-objective Pareto optimization,
and fairness considerations. This study further highlights trends and challenges, such as fairness-aware compression, robustness against adversarial attacks, and hardware-specific optimizations. Additionally, NAS-driven strategies are explored as a means to design task-aware, hardware-adaptive architectures that enhance LLM compression efficiency. Hybrid and adaptive methods are also examined to dynamically optimize computational efficiency across diverse deployment scenarios. This work not only synthesizes recent advancements and identifies open problems but also proposes a structured research roadmap to guide the
development of efficient, scalable, and equitable LLMs. By bridging the gap between compression research and real-world deployment, this study offers actionable insights for optimizing LLMs across a range of environments, including mobile devices and large-scale cloud infrastructures.
Original languageEnglish
JournalComplex & Intelligent Systems
Publication statusAccepted/In press - 1 Jul 2025

Keywords

  • large language model compression
  • knowledge distillation
  • quantization
  • pruning techniques
  • neural architecture search
  • resource-constrained environments
  • scalable AI systems
  • fairness in AI models
  • robustness against adversarial attacks
  • edge computing
  • adaptive compression
  • multi-objective optimization

Fingerprint

Dive into the research topics of 'A Review of State-of-the-Art Techniques for Large Language Model Compression'. Together they form a unique fingerprint.

Cite this