Neural Nets Are Overparameterized

Bibliography of ML papers related to distillation/compression/sparsification of neural nets, showing that NNs are highly overparameterized & inefficient, and equivalent much smaller faster NNs, implying any NN has a 'hardware overhang'.
bibliography⁠, NN⁠, technology
2017-12-152021-06-11 finished certainty: log importance: 6 backlinks

Neural nets are extremely ‘overparameterized’ in the sense that they have orders of magnitude more parameters than necessary to solve the problems they are trained on, as can be proven by smaller/faster but still performant networks but also in directly creating smaller neural nets with similar or identical performance on those problems by deleting parameters (sparsification)/reducing precision of the numeric encoding (compressing)/training a much smaller network from scratch using the original large network somehow (distillation); mysteriously, these smaller networks typically cannot be trained from scratch; performance gains can be obtained without the original data; models can be trained to imitate themselves in self-distillation; despite this indicating overfitting ought to be a major concern, they generalize well; and many of these smaller networks are in some sense already present in the original neural network. This is frequently taken to indicate some sort of blessing of scale in large NNs having smoother loss landscapes, which simple optimizers can successfully traverse to good optima no matter how hard the problem, as compared to smaller networks which may wind up ‘trapped’ at a bad place with no free parameters to let it slip around obstacles and find some way to improve (much less the loss landscape of equivalently powerful but extremely brittle encodings such as Brainfuck or assembler programs). As well as their great theoretical interest—How can we train these small models directly? What does this tell us about how NNs work?—such smaller NNs are critical to practical real-world deployment to servers & smartphones at scale, the design of accelerator hardware supporting reduced-precision operations, and also are an interesting case of capability growth for AI risk: as soon as any NN exists which can achieve performance goal X, it is likely that a much more efficient NN (potentially orders of magnitude smaller or faster) can be created to achieve X almost immediately thereafter. (These are merely one way that your software can be much faster⁠.)

Some examples of NNs being compressed in size or FLOPs by anywhere from 50% to ~17,000% (a vastly incomplete bibliography, merely some papers I have noted during my general reading):