Quantum-Inspired Tensor Network Decomposition for Extreme Compression of Large Language Models

1. Introduction

Large language models (LLMs) have achieved transformative performance across natural language processing, reasoning, and code generation tasks. However, models such as LLaMA-2 70B, GPT-4, and PaLM-2 contain tens to hundreds of billions of parameters, requiring specialized GPU clusters for inference. The gap between model capability and deployment feasibility represents one of the central challenges in modern AI.

Existing compression approaches attack this problem from several angles. Quantization methods such as GPTQ and AWQ reduce the bit-width of individual weights from 16-bit to 4-bit or lower, achieving roughly 4x compression. Pruning methods like SparseGPT remove individual weights or structured blocks, typically achieving 50-60% sparsity before significant accuracy degradation. Knowledge distillation trains a smaller student model to mimic a larger teacher, but requires expensive retraining.

Critically, all these methods operate on individual weight matrices in isolation. They fail to exploit the rich correlational structure that exists across layers of the transformer. This cross-layer structure is precisely the kind of global correlation that tensor network methods from quantum physics were designed to capture.

In this paper, we introduce TensorLM, a framework that reshapes the full parameter set of a transformer into a high-order tensor, decomposes it into a Tree Tensor Network State (TTNS), and achieves extreme compression ratios up to 18x while preserving task performance.

2. Background

2.1 Tensor Network States

A tensor network represents a high-order tensor as a contraction of lower-order tensors. Key variants include Matrix Product States (MPS), Tree Tensor Network States (TTNS), and MERA. The bond dimension chi governs the expressiveness-compression tradeoff.

2.2 Entanglement and Compression

The entanglement entropy S across any bipartition governs compression effectiveness. Low-entanglement states require small bond dimensions, yielding exponential compression.

3. TensorLM Method

3.1 Tensorization

We reshape transformer parameters into a tensor indexed by layer, head, and dimension indices, enabling cross-layer correlation capture.

3.2 TTNS Decomposition

We use a binary tree topology grouping layers hierarchically. For chi_max=256 on LLaMA-2 7B, this yields 389M parameters (18x compression).

3.3 DMRG-Inspired Optimization

We use variational sweeping: initialize via truncated SVD, sweep over nodes optimizing locally, and adapt bond dimensions based on entanglement spectra. 5-8 sweeps on a single A100 suffice.

3.4 Efficient Inference

Matrix-vector products use the tensor network directly, costing O(chi^2 d) per layer vs O(d^2) dense.

4. Experiments

Applied to LLaMA-2 7B and 13B, evaluated on MMLU, HellaSwag, ARC-Challenge, and WikiText-2 perplexity. TensorLM at 18x compression retains 98% MMLU performance. At 9x, it nearly matches the uncompressed baseline.

4.1 Entanglement Analysis

Middle layers (12-20) show highest entanglement and resist compression most. Peripheral layers compress easily. The entanglement profile correlates with pruning sensitivity (r=0.87).

4.2 Scaling

Larger models show lower entanglement density, enabling even higher compression ratios.

5. Conclusion

TensorLM bridges quantum information theory and neural network compression, achieving state-of-the-art compression with interpretable diagnostics.