Jump to content

Model compression

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Cosmia Nebula (talk | contribs) at 18:31, 18 October 2024 (Techniques: deep compression). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Model compression refers techniques in machine learning to reduce the size of trained models. Large models can achieve high accuracy but often come at the cost of significant resource requirements. Compression techniques aim to compress models without significantly sacrificing performance. Smaller models require less storage space and consume less memory and compute during inference.

Compressed models enable deployment on resource-constrained devices like smartphones and embedded systems, on-device AI, edge computing, and consumer electronics computers. Efficient inference is also valuable for large corporations who serve large model inference over an API, allowing them to reduce computational costs and improve response times for users.

Model compression is not model distillation, which trains a separate smaller model that imitates the input-output behavior of the larger model.

Techniques

Several techniques are employed for model compression.

Pruning

Pruning sparsifies a large model by setting some parameters to exactly zero. This effectively reduces the number of parameters. This allows the use of sparse matrix operations, which are faster than dense matrix operations.

Pruning criteria can be based on magnitudes of parameters, the statistical pattern of neural activations, Hessian values, etc.[1][2]

Quantization

Quantization reduces the numerical precision of weights and activations. For example, instead of storing weights as 32-bit floating-point numbers, they can be represented using 8-bit integers. Low-precision parameters take up less space, and takes less compute to perform arithmetics with.

It is also possible to quantize some parameters more aggressively than others, so for example, a less important parameter can have 8-bit precision while another, more important parameter, can have 16-bit precision. Inference with such models requires mixed-precision arithmetics.[3][4]

Quantized models can also be used during training (rather than after training). PyTorch implements automatic mixed-precision (AMP), which performs autocasting, gradient scaling, and loss scaling.[5][6]

Low-rank factorization

Weight matrices can be approximated by low-rank matrices. Let be a weight matrix of shape . A low-rank approximation is , where and are matrices of shapes . When is small, this both reduces the number of parameters needed to represent approximately, and accelerates matrix multiplication by .

Low-rank approximations can be found by singular value decomposition (SVD). The choice of rank for each weight matrix is a hyperparameter, and jointly optimized as a mixed discrete-continuous optimization problem.[7]

Training

Model compression is usually decoupled from training, that is, a model is first trained without regard for how it might be compressed, then it is compressed. However, it is possible to combine model compression with training.

The "train big, then compress" method trains a large model for a small number of training steps (less than it would be if it were trained to convergence), then heavily compress the model. It is found that at the same compute budget, this method results in a better model than lightly compressed, small models.[8]

In Deep Compression,[9] the compression has three steps.

  • First loop: prunes all weights lower than a threshold, then finetunes the network, then prune again, etc.
  • Second loop: clusters weights, then enforces weight sharing among all weights in each cluster, then finetunes the network, then clusters again, etc.
  • Third step: Use Huffman coding to losslessly compress the model.

References

  1. ^ Reed, R. (September 1993). "Pruning algorithms-a survey". IEEE Transactions on Neural Networks. 4 (5): 740–747. doi:10.1109/72.248452.
  2. ^ Blalock, Davis; Gonzalez Ortiz, Jose Javier; Frankle, Jonathan; Guttag, John (2020-03-15). "What is the State of Neural Network Pruning?". Proceedings of Machine Learning and Systems. 2: 129–146.
  3. ^ Abdelfattah, Ahmad; Anzt, Hartwig; Boman, Erik G.; Carson, Erin; Cojean, Terry; Dongarra, Jack; Gates, Mark; Grützmacher, Thomas; Higham, Nicholas J.; Li, Sherry; Lindquist, Neil; Liu, Yang; Loe, Jennifer; Luszczek, Piotr; Nayak, Pratik; Pranesh, Sri; Rajamanickam, Siva; Ribizel, Tobias; Smith, Barry; Swirydowicz, Kasia; Thomas, Stephen; Tomov, Stanimire; Tsai, Yaohung M.; Yamazaki, Ichitaro; Urike Meier Yang (2020). "A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic". arXiv:2007.06674 [cs.MS].
  4. ^ Micikevicius, Paulius; Narang, Sharan; Alben, Jonah; Diamos, Gregory; Elsen, Erich; Garcia, David; Ginsburg, Boris; Houston, Michael; Kuchaiev, Oleksii (2018-02-15). "Mixed Precision Training". arXiv:1710.03740 [cs.AI].
  5. ^ "Mixed Precision — PyTorch Training Performance Guide". residentmario.github.io. Retrieved 2024-09-10.
  6. ^ "What Every User Should Know About Mixed Precision Training in PyTorch". PyTorch. Retrieved 2024-09-10.
  7. ^ Idelbayev, Yerlan; Carreira-Perpinan, Miguel A. (2020). "Low-Rank Compression of Neural Nets: Learning the Rank of Each Layer": 8049–8059. {{cite journal}}: Cite journal requires |journal= (help)
  8. ^ Li, Zhuohan; Wallace, Eric; Shen, Sheng; Lin, Kevin; Keutzer, Kurt; Klein, Dan; Gonzalez, Joey (2020-11-21). "Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers". Proceedings of the 37th International Conference on Machine Learning. PMLR: 5958–5968.
  9. ^ Han, Song; Mao, Huizi; Dally, William J. (2016-02-15), Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, doi:10.48550/arXiv.1510.00149, retrieved 2024-10-18