Fully-Connected Neural Nets

Bibliography of ML papers related to multi-layer perceptrons/fully-connected neural nets, often showing surprising efficacy despite their reputation for being *too* general to be usable practically (representing a possible future Bitter Lesson).
bibliography⁠, NN
2021-04-242021-06-10 finished certainty: log importance: 5 backlinks

  1. Why now, if MLPs were always roughly data & compute-competitive with Transformers, and thus, CNNs? My current theory is that the critical ingredient is normalization and/or gating: MLPs, while always acknowledged as extremely powerful, underperform in practice or are highly unstable. Normalization & gating are relatively recent, typically post-2015, and they stabilize MLPs to the point where they Just Work. If you look at the current crop of MLP papers, what they all seem to have in common is normalization/gating (sometimes hidden or dismissed as an ‘Affine’ layer), and if you remove those ingredient, your loss may go from a perplexity of ~4 to >100, eg; and ones which don’t use these tricks, like many NeRF papers, are also extremely shallow. Combined with the great success of resnet CNNs, and it’s unsurprising if MLPs were not trial-and-errored enough post-2015 to discover that they worked until the cost of self-attention in Transformers drove interest in removing as much self-attention as possible—eventually leading to the discover that you can remove all of it.↩︎

  2. Note the use of careful initialization to make the MLPs trainable without residual layers, normalization, or gating.↩︎