‘MLP NN’ directory

Gwern

‘MLP NN’ directory

See Also
Gwern
Links
Miscellaneous
Bibliography

[page summary]

Gwern

“Dat Tail, Dat Flank—Never Forget ”, Gwern 2025

Dat Tail, Dat Flank—Never Forget

“Absolute Unit NNs: Regression-Based MLPs for Everything ”, Gwern 2023

Absolute Unit NNs: Regression-Based MLPs for Everything

“Research Ideas ”, Gwern 2017

Research Ideas

“Modular Brain AUNNs for Uploads ”, Gwern 2023

Modular Brain AUNNs for Uploads

“Language-Conditioned Absolute Unit NNs ”, Gwern 2022

Language-Conditioned Absolute Unit NNs

Miscellaneous

Bibliography

https://arxiv.org/abs/2503.24187: “NeuRaLaTeX: A Machine Learning Library Written in Pure LaTeX ”, James A. D. Gardner, Will Rowan, William A. P. Smith

link-bibliography
https://arxiv.org/abs/2501.03992: “NeuralSVG: An Implicit Representation for Text-To-Vector Generation ”, Sagi Polaczek, Yuval Alaluf, Elad Richardson, Yael Vinker, Daniel Cohen-Or

link-bibliography
https://www.lesswrong.com/posts/LncYobrn3vRr7qkZW/the-slingshot-helps-with-learning: “The Slingshot Helps With Learning ”, Wilson Wu

link-bibliography
https://arxiv.org/abs/2406.15786: “What Matters in Transformers? Not All Attention Is Needed ”, Shwai He, Guoheng Sun, Zheyu Shen, Ang Li

link-bibliography
https://arxiv.org/abs/2406.13131: “When Parts Are Greater Than Sums: Individual LLM Components Can Outperform Full Models ”, Ting-Yun Chang, Jesse Thomason, Robin Jia

link-bibliography
https://arxiv.org/abs/2406.11233: “Probing the Decision Boundaries of In-Context Learning in Large Language Models ”, Siyan Zhao, Tung Nguyen, Aditya Grover

link-bibliography
https://arxiv.org/abs/2405.20233: “Grokfast: Accelerated Grokking by Amplifying Slow Gradients ”, Jaerin Lee, Bong Gyun Kang, Kihoon Kim, Kyoung Mu Lee

link-bibliography
https://arxiv.org/abs/2310.13061: “To Grok or Not to Grok: Disentangling Generalization and Memorization on Corrupted Algorithmic Datasets ”, Darshil Doshi, Aritra Das, Tianyu He, Andrey Gromov

link-bibliography
https://arxiv.org/abs/2310.08708: “Polynomial Time Cryptanalytic Extraction of Neural Network Models ”, Adi Shamir, Isaac Canales-Martinez, Anna Hambitzer, Jorge Chavez-Saab, Francisco Rodrigez-Henriquez, Nitin Satpute

link-bibliography
https://arxiv.org/abs/2306.13575: “Scaling MLPs: A Tale of Inductive Bias ”, Gregor Bachmann, Sotiris Anagnostidis, Thomas Hofmann

link-bibliography
https://arxiv.org/abs/2303.13506: “The Quantization Model of Neural Scaling ”, Eric J. Michaud, Ziming Liu, Uzay Girit, Max Tegmark

link-bibliography
https://arxiv.org/abs/2303.06053#google: “TSMixer: An All-MLP Architecture for Time Series Forecasting ”, Si-An Chen, Chun-Liang Li, Nate Yoder, Sercan O. Arik, Tomas Pfister

link-bibliography
2023-bures.pdf: “Organic Reaction Mechanism Classification Using Machine Learning ”, Jordi Burés, Igor Larrosa

link-bibliography
https://www.nature.com/articles/s41467-022-35422-y: “Merging Enzymatic and Synthetic Chemistry With Computational Synthesis Planning ”, Itai Levin, Mengjie Liu, Christopher A. Voigt, Connor W. Coley

link-bibliography
https://arxiv.org/abs/2211.03495: “How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers ”, Michael Hassid, Hao Peng, Daniel Rotem, Jungo Kasai, Ivan Montero, Noah Smith, Roy Schwartz

link-bibliography
https://arxiv.org/abs/2210.06313#google: “The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers ”, Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, Sanjiv Kumar

link-bibliography
https://arxiv.org/abs/2210.03310#google: “Scaling Forward Gradient With Local Losses ”, Mengye Ren, Simon Kornblith, Renjie Liao, Geoffrey Hinton

link-bibliography
https://arxiv.org/abs/2210.01117: “Omnigrok: Grokking Beyond Algorithmic Data ”, Ziming Liu, Eric J. Michaud, Max Tegmark

link-bibliography
https://arxiv.org/abs/2209.12892: “g.pt: Learning to Learn With Generative Models of Neural Network Checkpoints ”, William Peebles, Ilija Radosavovic, Tim Brooks, Alexei A. Efros, Jitendra Malik

link-bibliography
https://arxiv.org/abs/2207.10551#google: “Scaling Laws vs Model Architectures: How Does Inductive Bias Influence Scaling? ”, Yi Tay, Mostafa Dehghani, Samira Abnar, Hyung Won Chung, William Fedus, Jinfeng Rao, Sharan Narang, Vinh Q. Tran, Dani Yogatama, Donald Metzler

link-bibliography
https://arxiv.org/abs/2206.07137: “RHO-LOSS: Prioritized Training on Points That Are Learnable, Worth Learning, and Not Yet Learnt ”, Sören Mindermann, Jan Brauner, Muhammed Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt Höltgen, Aidan N. Gomez, Adrien Morisot, Sebastian Farquhar, Yarin Gal

link-bibliography
https://arxiv.org/abs/2206.05852: “ChordMixer: A Scalable Neural Attention Model for Sequences With Different Lengths ”, Ruslan Khalitov, Tong Yu, Lei Cheng, Zhirong Yang

link-bibliography
https://arxiv.org/abs/2205.12399#google: “Sparse Mixers: Combining MoE and Mixing to Build a More Efficient BERT ”, James Lee-Thorp, Joshua Ainslie

link-bibliography
https://arxiv.org/abs/2205.10343: “Towards Understanding Grokking: An Effective Theory of Representation Learning ”, Ziming Liu, Ouail Kitouni, Niklas Nolte, Eric J. Michaud, Max Tegmark, Mike Williams

link-bibliography
https://arxiv.org/abs/2204.10670: “Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better Than Dot-Product Self-Attention ”, Tong Yu, Ruslan Khalitov, Lei Cheng, Zhirong Yang

link-bibliography
https://arxiv.org/abs/2203.06850: “Efficient Language Modeling With Sparse All-MLP ”, Ping Yu, Mikel Artetxe, Myle Ott, Sam Shleifer, Hongyu Gong, Ves Stoyanov, Xian Li

link-bibliography
https://arxiv.org/abs/2203.03691: “HyperMixer: An MLP-Based Low Cost Alternative to Transformers ”, Florian Mai, Arnaud Pannatier, Fabio Fehr, Haolin Chen, Francois Marelli, Francois Fleuret, James Henderson

link-bibliography
https://arxiv.org/abs/2202.06510#microsoft: “Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs ”, Huangjie Zheng, Pengcheng He, Weizhu Chen, Mingyuan Zhou

link-bibliography
https://arxiv.org/abs/2201.10801: “When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism (ShiftViT) ”, Guangting Wang, Yucheng Zhao, Chuanxin Tang, Chong Luo, Wenjun Zeng

link-bibliography
https://arxiv.org/abs/2201.09792: “ConvMixer: Patches Are All You Need? ”, Asher Trockman, J. Zico Kolter

link-bibliography
https://arxiv.org/abs/2111.11418: “MetaFormer Is Actually What You Need for Vision ”, Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, Shuicheng Yan

link-bibliography
https://arxiv.org/abs/2110.11526#deepmind: “Wide Neural Networks Forget Less Catastrophically ”, Seyed Iman Mirzadeh, Arslan Chaudhry, Dong Yin, Huiyi Hu, Razvan Pascanu, Dilan Gorur, Mehrdad Farajtabar

link-bibliography
https://arxiv.org/abs/2110.02095#google: “Exploring the Limits of Large Scale Pre-Training ”, Samira Abnar, Mostafa Dehghani, Behnam Neyshabur, Hanie Sedghi

link-bibliography
https://arxiv.org/abs/2109.05422: “Sparse MLP for Image Recognition: Is Self-Attention Really Necessary? ”, Chuanxin Tang, Yucheng Zhao, Guangting Wang, Chong Luo, Wenxuan Xie, Wenjun Zeng

link-bibliography
https://arxiv.org/abs/2109.04454: “ConvMLP: Hierarchical Convolutional MLPs for Vision ”, Jiachen Li, Ali Hassani, Steven Walton, Humphrey Shi

link-bibliography
https://arxiv.org/abs/2108.13002#microsoft: “A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP ”, Yucheng Zhao, Guangting Wang, Chuanxin Tang, Chong Luo, Wenjun Zeng, Zheng-Jun Zha

link-bibliography
https://arxiv.org/abs/2108.13341#huawei: “Hire-MLP: Vision MLP via Hierarchical Rearrangement ”, Jianyuan Guo, Yehui Tang, Kai Han, Xinghao Chen, Han Wu, Chao Xu, Chang Xu, Yunhe Wang

link-bibliography
https://arxiv.org/abs/2108.04384: “RaftMLP: How Much Can Be Done Without Attention and With Less Spatial Locality? ”, Yuki Tatsunami, Masato Taki

link-bibliography
https://arxiv.org/abs/2108.01072#baidu: “S²-MLPv2: Improved Spatial-Shift MLP Architecture for Vision ”, Tan Yu, Xu Li, Yunfeng Cai, Mingming Sun, Ping Li

link-bibliography
https://arxiv.org/abs/2107.10224: “CycleMLP: A MLP-Like Architecture for Dense Prediction ”, Shoufa Chen, Enze Xie, Chongjian Ge, Runjian Chen, Ding Liang, Ping Luo

link-bibliography
https://arxiv.org/abs/2107.08391: “AS-MLP: An Axial Shifted MLP Architecture for Vision ”, Dongze Lian, Zehao Yu, Xing Sun, Shenghua Gao

link-bibliography
https://arxiv.org/abs/2106.12368: “Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition ”, Qibin Hou, Zihang Jiang, Li Yuan, Ming-Ming Cheng, Shuicheng Yan, Jiashi Feng

link-bibliography
https://arxiv.org/abs/2106.12372#nvidia: “Real-Time Neural Radiance Caching for Path Tracing ”, Thomas Müller, Fabrice Rousselle, Jan Novák, Alexander Keller

link-bibliography
https://arxiv.org/abs/2106.07477#baidu: “S²-MLP: Spatial-Shift MLP Architecture for Vision ”, Tan Yu, Xu Li, Yunfeng Cai, Mingming Sun, Ping Li

link-bibliography
https://arxiv.org/abs/2106.01548: “When Vision Transformers Outperform ResNets without Pre-Training or Strong Data Augmentations ”, Xiangning Chen, Cho-Jui Hsieh, Boqing Gong

link-bibliography
https://arxiv.org/abs/2106.01401: “Container: Context Aggregation Network ”, Peng Gao, Jiasen Lu, Hongsheng Li, Roozbeh Mottaghi, Aniruddha Kembhavi

link-bibliography
https://arxiv.org/abs/2105.08050#google: “Pay Attention to MLPs ”, Hanxiao Liu, Zihang Dai, David R. So, Quoc V. Le

link-bibliography
https://arxiv.org/abs/2105.03824#google: “FNet: Mixing Tokens With Fourier Transforms ”, James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon

link-bibliography
https://arxiv.org/abs/2105.02723: “Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet ”, Luke Melas-Kyriazi

link-bibliography
https://arxiv.org/abs/2105.01883: “RepMLP: Re-Parameterizing Convolutions into Fully-Connected Layers for Image Recognition ”, Xiaohan Ding, Chunlong Xia, Xiangyu Zhang, Xiaojie Chu, Jungong Han, Guiguang Ding

link-bibliography
https://arxiv.org/abs/2105.01601#google: “MLP-Mixer: An All-MLP Architecture for Vision ”, Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy

link-bibliography
2021-power.pdf#openai: “Grokking: Generalization Beyond Overfitting On Small Algorithmic Datasets ”, Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, Vedant Misra

link-bibliography
abstract: “Fully-Connected Neural Nets ”, Gwern

link-bibliography
https://arxiv.org/abs/2103.14030: “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows ”, Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo

link-bibliography
https://arxiv.org/abs/2011.13775: “Image Generators With Conditionally-Independent Pixel Synthesis ”, Ivan Anokhin, Kirill Demochkin, Taras Khakhulin, Gleb Sterkin, Victor Lempitsky, Denis Korzhenkov

link-bibliography
https://arxiv.org/abs/2005.00743#google: “Synthesizer: Rethinking Self-Attention in Transformer Models ”, Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng

link-bibliography
https://arxiv.org/abs/2003.01629: “Can Increasing Input Dimensionality Improve Deep Reinforcement Learning? ”, Kei Ota, Tomoaki Oiki, Devesh K. Jha, Toshisada Mariyama, Daniel Nikovski

link-bibliography
https://arxiv.org/abs/1911.13299: “What’s Hidden in a Randomly Weighted Neural Network? ”, Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, Mohammad Rastegari

link-bibliography
https://arxiv.org/abs/1804.00222#google: “Meta-Learning Update Rules for Unsupervised Representation Learning ”, Luke Metz, Niru Maheswaranathan, Brian Cheung, Jascha Sohl-Dickstein

link-bibliography
2017-sabatelli.pdf#page=3: “Learning to Play Chess With Minimal Lookahead and Deep Value Neural Networks ”, Matthia Sabatelli

link-bibliography
https://arxiv.org/abs/1402.1869: “On the Number of Linear Regions of Deep Neural Networks ”, Guido Montúfar, Razvan Pascanu, Kyunghyun Cho, Yoshua Bengio

link-bibliography
2011-collobert.pdf: “Natural Language Processing (Almost) from Scratch ”, Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa

link-bibliography