×

CAT: compression-aware training for bandwidth reduction. (English) Zbl 07626784

Summary: One major obstacle hindering the ubiquitous use of CNNs for inference is their relatively high memory bandwidth requirements, which can be the primary energy consumer and throughput bottleneck in hardware accelerators. Inspired by quantization-aware training approaches, we propose a compression-aware training (CAT) method that involves training the model to allow better compression of weights and feature maps during neural network deployment. Our method trains the model to achieve low-entropy feature maps, enabling efficient compression at inference time using classical transform coding methods. CAT significantly improves the state-of-the-art results reported for quantization evaluated on various vision and NLP tasks, such as image classification (ImageNet), image detection (Pascal VOC), sentiment analysis (CoLa), and textual entailment (MNLI). For example, on ResNet-18, we achieve near baseline ImageNet accuracy with an average representation of only 1.5 bits per value with 5-bit quantization. Moreover, we show that entropy reduction of weights and activations can be applied together, further improving bandwidth reduction. Reference implementation is available.

MSC:

68T05 Learning and adaptive systems in artificial intelligence

References:

[1] Eirikur Agustsson, Fabian Mentzer, Michael Tschannen, Lukas Cavigelli, Radu Timofte, Luca Benini, and Luc Van Gool. Soft-to-hard vector quantization for end-to-end learning compressible representations. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30, pages 1141-1151. Curran Associates, Inc., 2017.
[2] Caglar Aytekin, Francesco Cricri, and Emre Aksu. Compressibility loss for neural network weights.arXiv preprint arXiv:1905.01044, 2019.
[3] Chaim Baskin, Natan Liss, Yoav Chai, Evgenii Zheltonozhskii, Eli Schwartz, Raja Giryes, Avi Mendelson, and Alexander M. Bronstein. NICE: noise injection and clamping estimation for neural network quantization.arXiv preprint arXiv:1810.00162, 2018.
[4] Yoshua Bengio, Nicholas L´eonard, and Aaron C. Courville.Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint
[5] Lukas Cavigelli, Georg Rutishauser, and Luca Benini. EBPC: Extended bit-plane compression for deep neural network inference and training accelerators.IEEE Journal on Emerging
[6] Mahesh Chandra. Data bandwidth reduction in deep neural network SoCs using history buffer and Huffman coding. In2018 International Conference on Computing, Power and
[7] Brian Chmiel, Chaim Baskin, Evgenii Zheltonozhskii, Ron Banner, Yevgeny Yermolin, Alex Karbachevsky, Alex M. Bronstein, and Avi Mendelson. Feature map transform coding for energy-efficient cnn inference. In2020 International Joint Conference on Neural Networks · Zbl 07626784
[8] Jungwook Choi, Pierce I-Jen Chuang, Zhuo Wang, Swagath Venkataramani, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. Bridging the accuracy gap for 2-bit quantized neural networks (QNN).arXiv preprint arXiv:1807.06964, 2018a.
[9] Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. Pact: Parameterized clipping activation for quantized neural networks.arXiv preprint arXiv:1805.06085, 2018b.
[10] Tim Dettmers and Luke Zettlemoyer. Sparse networks from scratch: Faster training without losing performance.arXiv preprint arXiv:1907.04840, 2019.
[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the
[12] Zhen Dong, Zhewei Yao, Amir Gholami, Michael W. Mahoney, and Kurt Keutzer. HAWQ: Hessian aware quantization of neural networks with mixed-precision. InProceedings of
[13] Jarek Duda, Khalid Tahboub, Neeraj J. Gadgil, and Edward J. Delp. The use of asymmetric numeral systems as an accurate replacement for huffman coding. In2015 Picture Coding Symposium (PCS), pages 65-69, May 2015. doi: 10.1109/PCS.2015.7170048.
[14] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The Pascal Visual Object Classes (VOC) Challenge.International Journal of
[15] Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019.
[16] Ruihao Gong, Xianglong Liu, Shenghu Jiang, Tianxiang Li, Peng Hu, Jiazhen Lin, Fengwei Yu, and Junjie Yan. Differentiable soft quantization: Bridging full-precision and low-bit neural networks. InProceedings of the IEEE/CVF International Conference on Computer
[17] Denis Gudovskiy, Alec Hodgkinson, and Luca Rigazio. DNN feature map compression using learned representation over GF(2). InThe European Conference on Computer Vision
[18] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding.arXiv preprint
[19] Patrik O. Hoyer. Non-negative matrix factorization with sparseness constraints.Journal of machine learning research, 5(Nov):1457-1469, 2004. · Zbl 1222.68218
[20] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations.arXiv preprint arXiv:1609.07061, 2016.
[21] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel-softmax. InInternational Conference on Learning Representations, 2017.
[22] Qing Jin, Linjie Yang, and Zhenyu Liao. Towards efficient training for neural network quantization.arXiv preprint arXiv:1912.10207, 2019.
[23] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. In-datacenter performance analysis of a tensor processing unit.ACM SIGARCH Computer Architecture News, 45(2):1-12, June 2017. ISSN 0163-5964. doi: 10.1145/3140659.3080246.
[24] Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. In D. S. Touretzky, editor,Advances in Neural Information Processing Systems, volume 2, pages 598-605. Morgan-Kaufmann, 1990.
[25] Jun Haeng Lee, Sangwon Ha, Saerom Choi, Won-Jo Lee, and Seungwon Lee. Quantization for rapid deployment of deep neural networks.arXiv preprint arXiv:1810.05488, 2018.
[26] Congcong Li. High quality, fast, modular reference implementation of SSD in PyTorch. https://github.com/lufficc/SSD, 2018.
[27] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. InInternational Conference on Learning Representations, 2017.
[28] Chien-Yu Lin and Bo-Cheng Lai. Supporting compressed-sparse activations and weights on SIMD-like accelerator for sparse convolutional neural networks. In2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC), pages 105-110, Jan 2018. doi: 10.1109/ASPDAC.2018.8297290.
[29] Chunlei Liu, Wenrui Ding, Xin Xia, Yuan Hu, Baochang Zhang, Jianzhuang Liu, Bohan Zhuang, and Guodong Guo. Rectified binary convolutional networks for enhancing the performance of 1-bit DCNNs. InProceedings of the Twenty-Eighth International
[30] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: Single shot multibox detector. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors,The European Conference on Computer
[31] Asit Mishra, Eriko Nurvitadhi, Jeffrey J. Cook, and Debbie Marr. WRPN: Wide reducedprecision networks. InInternational Conference on Learning Representations, 2018.
[32] Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation for neural network pruning. InThe IEEE Conference on Computer Vision
[33] Raghid Morcel, Hazem Hajj, Mazen A. R. Saghir, Haitham Akkary, Hassan Artail, Rahul Khanna, and Anil Keshavamurthy. FeatherNet: An accelerated convolutional neural network design for resource-constrained FPGAs.ACM Transactions on Reconfigurable
[34] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. XNOR-Net: Imagenet classification using binary convolutional neural networks. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors,The European Conference on Computer
[35] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.International Journal of
[36] Claude E. Shannon. A mathematical theory of communication.The Bell System Technical Journal, 27(3):379-423, July 1948. doi: 10.1002/j.1538-7305.1948.tb01338.x. · Zbl 1154.94303
[37] Kevin Siu, Dylan Malone Stuart, Mostafa Mahmoud, and Andreas Moshovos. Memory requirements for convolutional neural network hardware accelerators. In2018 IEEE
[38] Wojciech Szpankowski. Asymptotic average redundancy of Huffman (and Shannon-Fano) block codes.In2000 IEEE International Symposium on Information Theory (Cat. · Zbl 1003.94023
[39] Erwei Wang, James J. Davis, Peter Y. K. Cheung, and George A. Constantinides. LUTNet: Rethinking inference in FPGA soft logic. In2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 26-34, 2019. doi: 10.1109/FCCM.2019.00014.
[40] Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. Neural Network Acceptability Judgments.Transactions of the Association for Computational Linguistics, 7:625-641, 09 2019. ISSN 2307-387X. doi: 10.1162/tacl a 00290.
[41] Arie Wahyu Wijayanto, Jun Jin Choong, Kaushalya Madhawa, and Tsuyoshi Murata. Towards robust compressed convolutional neural networks. In2019 IEEE International
[42] Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. InProceedings of the 2018 Conference
[43] Qingcheng Xiao, Yun Liang, Liqiang Lu, Shengen Yan, and Yu-Wing Tai. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. InProceedings of the 54th Annual Design Automation Conference 2017, DAC ’17, pages 62:1-62:6, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-4927-7. doi: 10.1145/ 3061639.3062244.
[44] Yu Xing, Shuang Liang, Lingzhi Sui, Xijie Jia, Jiantao Qiu, Xin Liu, Yushun Wang, Yi Shan, and Yu Wang. Dnnvm: End-to-end compiler leveraging heterogeneous optimizations on fpga-based cnn accelerators.IEEE Transactions on Computer-Aided Design of Integrated
[45] Guandao Yang, Tianyi Zhang, Polina Kirichenko, Junwen Bai, Andrew Gordon Wilson, and Chris De Sa. SWALP : Stochastic weight averaging in low precision training. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International
[46] Jiwei Yang, Xu Shen, Jun Xing, Xinmei Tian, Houqiang Li, Bing Deng, Jianqiang Huang, and Xian-sheng Hua. Quantization networks. InThe IEEE Conference on Computer
[47] Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, and Vivienne Sze. A method to estimate the energy consumption of deep neural networks. In2017 51st Asilomar Conference on Signals, Systems, and Computers, pages 1916-1920. IEEE, 2017a.
[48] Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing energy-efficient convolutional neural networks using energy-aware pruning. InProceedings of the IEEE Conference on
[49] Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. LQ-Nets: Learned quantization for highly accurate and compact deep neural networks. InThe European
[50] Yiren Zhao, Xitong Gao, Daniel Bates, Robert Mullins, and Cheng-Zhong Xu. Efficient and effective quantization for sparse DNNs.arXiv preprint arXiv:1903.03046, 2019.
[51] Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients.arXiv preprint arXiv:1606.06160, 2016.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.