The amount of parameters is reduced by 85%, and the performance exceeds ViT: the new image classification method ViR

Researchers from East China Normal University and other institutions have proposed a new image classification method ViR, which is superior to ViT in terms of model and computational complexity. In addition, ViT with multiple Transformer coding layers often overfits, especially when the training data is limited. By dividing each image into a series of fixed-length tokens, ViR builds a pure library with almost fully connected topology to replace the Transformer module in ViT. Compared with ViT without pre-training, the initial accuracy and final accuracy of ViR are improved. At the same depth, the time cost of ViR is much lower than ViT. Then, the researcher described the proposed ViR network and further gave examples of deep ViR. In the ViT line, the patch size is the same for all tested data sets. Table 3 below shows the comparison of the accuracy of classification and the amount of parameters. The number suffix indicates the number of ViR layers or encoders of ViT.