Methods and hardware to accelerate the work of a convolutional neural network
DOI:
https://doi.org/10.15276/aait.06.2023.1Keywords:
Convolutional neural networks, hardware accelerator, problem-oriented approach, parallel-stream implementation, multi-input adder, scalar product, two-dimensional convolutionAbstract
Three main approaches to building computer systems are analyzed and allocated: software, hardware, and problem-oriented. A problem-oriented approach was chosen for the implementation of CNN. This approach uses a processor core with hardware accelerators that implement basic CNN operations. The development of computer systems for the implementation of CNN should be carried out based on an integrated approach. This approach includes a modern element base, existing hardware, and software for the implementation of the CNN; methods and algorithms for the implementation of CNN; methods, algorithms, and VLSI structure for the implementation of basic operations of the CNN; methods and means of computer-aided design of hardware and software focused on the implementation of CNN computer systems. For the development of computer systems for the implementation of CNN chosen approach, which includes: variable composition of equipment; use of the basis of elementary arithmetic operations; organization of the process of calculating the scalar product as execution single operation; pipeline and spatial parallelism; localization and simplification of links between the steps of the conveyor; coordination of the time of formation of input data and weighting coefficients with the duration of the conveyor cycle. It is shown that in order to reduce the processing time of large images, it is most expedient to use parallel-stream VLSI -implementation of basic operations. The modified Booth algorithm for forming partial products in a parallel-threaded computing device is selected, which decreased the number of steps in the pipeline. The method of group summation has been improved, which, with multi-input single-digit adders, combined according to the principle of the Wallace tree, provides a reduction in summation time. The method of parallel-flow calculation of scalar product in a sliding window is developed, which, by coordinating the time of receipt of columns of input data and weighting coefficients with the duration of the conveyor cycle, provides high efficiency of equipment use and calculations in real-time. The main ways regarding coordination of the time of receipt of input data columns and weighting coefficients with the duration of the conveyor stroke of hardware that implement two-dimensional convolution are determined. The hardware structure for the realization of two-dimensional convolution in a sliding window, which is focused on VLSI- implementation with high efficiency of equipment use, has been developed. Programmable logic integrated circuits selected for the implementation of hardware accelerators. Single-bit 7, 15, and 31 input adders were developed and modeled on the basis of FPGA EP3C16F484 of the Cyclone III family of Altera company, and an 8-input 7-bit adder was synthesized on their basis