notes/FC (Link Bibliography)

“notes/​FC” links:

  1. https://jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

  2. ⁠, Simeng Sun, Mohit Iyyer (2021-04-08):

    Recent progress in language modeling has been driven not only by advances in neural architectures, but also through hardware and optimization improvements. In this paper, we revisit the neural probabilistic language model (NPLM) of , which simply concatenates word embeddings within a fixed window and passes the result through a feed-forward network to predict the next word. When scaled up to modern hardware, this model (despite its many limitations) performs much better than expected on word-level language model benchmarks. Our analysis reveals that the NPLM achieves lower perplexity than a baseline with short input contexts but struggles to handle long-term dependencies. Inspired by this result, we modify the Transformer by replacing its first self-attention layer with the NPLM’s local concatenation layer, which results in small but consistent perplexity decreases across three word-level language modeling datasets.

  3. ⁠, Zhaozhuo Xu, Minghao Yan, Junyan Zhang, Anshumali Shrivastava (2021-06-15):

    Transformer models have demonstrated superior performance in natural language processing. The dot product self-attention in Transformer allows us to model interactions between words. However, this modeling comes with significant computational overhead. In this work, we revisit the memory-compute trade-off associated with Transformer, particularly multi-head attention, and show a memory-heavy but statistically-significantly more compute-efficient alternative to Transformer. Our proposal, denoted as PairConnect, a multilayer perceptron (MLP), models the pairwise interaction between words by explicit pairwise word embeddings. As a result, PairConnect substitutes self dot product with a simple embedding lookup. We show mathematically that despite being an MLP, our compute-efficient PairConnect is strictly more expressive than Transformer. Our experiment on language modeling tasks suggests that PairConnect could achieve comparable results with Transformer while reducing the computational cost associated with inference significantly.

  4. 2003-simard.pdf#microsoft: ⁠, Patrice Y. Simard, Dave Steinkraus, John C. Platt (2003; ai):

    Neural networks are a powerful technology for classification of visual inputs arising from documents. However, there is a confusing plethora of different neural network methods that are used in the literature and in industry.

    This paper describes a set of concrete best practices that document analysis researchers can use to get good results with neural networks.

    The most important practice is getting a training set as large as possible: we expand the training set by adding a new form of distorted data.

    The next most important practice is that convolutional neural networks are better suited for visual document tasks than fully connected networks. We propose that a simple “do-it-yourself” implementation of convolution with a flexible architecture is suitable for many visual document problems. This simple convolutional neural network does not require complex methods, such as momentum, ⁠, structure-dependent learning rates, averaging layers, tangent prop, or even finely-tuning the architecture.

    The end result is a very simple yet general architecture which can yield state-of-the-art performance for document analysis.

    We illustrate our claims on the MNIST set of English digit images.

  5. 2006-chatelain.pdf: ⁠, Clément Chatelain (2006-12-05; ai):

    Within the framework of the automatic processing of incoming mail documents, we present in this thesis the conception and development of a numerical field extraction system in weakly constrained handwritten documents.

    Although the recognition of isolated handwritten entities can be considered as a partially solved problem, the extraction of information in images of complex and free-layout documents is still a challenge. This problem requires the implementation of both handwriting recognition and information extraction methods inspired by approaches developed within the field of information extraction in electronic documents.

    Our contribution consists in the conception and the implementation of 2 different strategies: the first extends classical handwriting recognition methods, while the second is inspired from approaches used within the field of information extraction in electronic documents.

    The results obtained on a real handwritten mail database show that our second approach is substantially better.

    Finally, a complete, generic and efficient system is produced, answering one of the emergent perspectives in the field of the automatic reading of handwritten documents: the extraction of complex information in images of documents. [Text of paper is in French.]

  6. 2012-ciresan.pdf: ⁠, Dan Claudiu Cireşan, Ueli Meier, Luca Maria Gambardella, Jürgen Schmidhuber (2012; ai):

    The competitive MNIST handwritten digit recognition benchmark has a long history of broken records since 1998. The most recent advancement by others dates back 8 years (error rate 0.4%).

    Good old on-line for plain multi-layer perceptrons yields a very low 0.35% error rate on the MNIST handwritten digits benchmark with a single MLP, and 0.31% with a committee of 7 MLPs.

    All we need to achieve this until-2011-best-result are many hidden layers, many neurons per layer, numerous deformed training images to avoid overfitting, and graphics cards to greatly speed up learning.

    [Keywords: neural network, multilayer perceptron, ⁠, training set deformations, MNIST, committee, backpropagation]

    Note: This work combines 3 previously published papers [1,2,3].

    …In recent decades the amount of raw computing power per Euro has grown bya factor of 100–1000 per decade. Our results show that this ongoing hardware progress may be more important than advances in algorithms and software (although the future will belong to methods combining the best of both worlds). Current graphics cards (GPUs) are already more than 50× faster than standard microprocessors when it comes to training big and deep neural networks by the ancient algorithm, online backpropagation (weight update rate up to 7.5×109/​​​​s, and more than 1015 per trained network). On the competitive MNIST handwriting benchmark, single precision floating-point GPU-based neural nets surpass all previously reported results, including those obtained by much more complex methods involving specialized architectures, unsupervised pre-training, combinations of machine learning classifiers etc. Training sets of sufficient size to avoid overfitting are obtained by appropriately deforming images.

    Of course, the approach is not limited to handwriting, and obviously holds great promise for many visual and other pattern recognition problems.

  7. ⁠, Lei Jimmy Ba, Rich Caruana (2013-12-21):

    Currently, deep neural networks are the state of the art on problems such as speech recognition and computer vision. In this extended abstract, we show that shallow feed-forward networks can learn the complex functions previously learned by deep nets and achieve accuracies previously only achievable with deep models. Moreover, in some cases the shallow neural nets can learn these deep functions using a total number of parameters similar to the original deep model. We evaluate our method on the TIMIT phoneme recognition task and are able to train shallow fully-connected nets that perform similarly to complex, well-engineered, deep convolutional architectures. Our success in training shallow neural nets to mimic deeper models suggests that there probably exist better algorithms for training shallow feed-forward nets than those currently available.

  8. ⁠, Min Lin, Qiang Chen, Shuicheng Yan (2013-12-16):

    We propose a novel deep network structure called “Network In Network” (NIN) to enhance model discriminability for local patches within the receptive field. The conventional convolutional layer uses linear filters followed by a nonlinear activation function to scan the input. Instead, we build micro neural networks with more complex structures to abstract the data within the receptive field. We instantiate the micro neural network with a multilayer perceptron, which is a potent function approximator. The feature maps are obtained by sliding the micro networks over the input in a similar manner as ⁠; they are then fed into the next layer. Deep NIN can be implemented by stacking mutiple of the above described structure. With enhanced local modeling via the micro network, we are able to utilize global average pooling over feature maps in the classification layer, which is easier to interpret and less prone to overfitting than traditional fully connected layers. We demonstrated the state-of-the-art classification performances with NIN on CIFAR-10 and CIFAR-100, and reasonable performances on SVHN and MNIST datasets.

  9. ⁠, Zhouhan Lin, Rol, Memisevic, Kishore Konda (2015-11-09):

    We propose ways to improve the performance of fully connected networks. We found that two approaches in particular have a strong effect on performance: linear bottleneck layers and unsupervised pre-training using autoencoders without hidden unit biases. We show how both approaches can be related to improving gradient flow and reducing sparsity in the network. We show that a fully connected network can yield approximately 70% classification accuracy on the permutation-invariant CIFAR-10 task, which is much higher than the current state-of-the-art. By adding deformations to the training data, the fully connected network achieves 78% accuracy, which is just 10% short of a decent convolutional network.

  10. 2015-bluche.pdf: ⁠, Théodore Bluche (2015-05-13; ai):

    The automatic transcription of text in handwritten documents has many applications, from automatic document processing, to indexing and document understanding.

    One of the most popular approaches nowadays consists in scanning the text line image with a sliding window, from which features are extracted, and modeled by Hidden Markov Models (HMMs). Associated with neural networks, such as Multi-Layer Perceptrons (MLPs) or Long Short-Term Memory Recurrent Neural Networks (-RNNs), and with a language model, these models yield good transcriptions. On the other hand, in many machine learning applications, including speech recognition and computer vision, deep neural networks consisting of several hidden layers recently produced a large reduction of error rates.

    In this thesis, we have conducted a thorough study of different aspects of optical models based on deep neural networks in the hybrid neural network / HMM scheme, in order to better understand and evaluate their relative importance.

    1. First, we show that deep neural networks produce consistent and large improvements over networks with one or 2 hidden layers, independently of the kind of neural network, MLP or RNN, and of input, handcrafted features or pixels.
    2. Then, we show that deep neural networks with pixel inputs compete with those using handcrafted features, and that depth plays an important role in the reduction of the performance gap between the 2 kinds of inputs, supporting the idea that deep neural networks effectively build hierarchical and relevant representations of their inputs, and that features are automatically learnt on the way.
    3. Despite the dominance of LSTM-RNNs in the recent literature of handwriting recognition, we show that deep MLPs achieve comparable results. Moreover, we evaluated different training criteria. With sequence-discriminative training, we report similar improvements for MLP/​​​​​HMMs as those observed in speech recognition.
    4. We also show how the Connectionist Temporal Classification framework is especially suited to RNNs.
    5. Finally, the novel dropout technique to regularize neural networks was recently applied to LSTM-RNNs. We tested its effect at different positions in LSTM-RNNs, thus extending previous works, and we show that its relative position to the recurrent connections is important.

    We conducted the experiments on 3 public databases, representing 2 languages (English and French) and 2 epochs, using different kinds of neural network inputs: handcrafted features and pixels. We validated our approach by taking part to the HTRtS contest in 2014.

    The results of the final systems presented in this thesis, namely MLPs and RNNs, with handcrafted feature or pixel inputs, are comparable to the state-of-the-art on Rimes and IAM. Moreover, the combination of these systems outperformed all published results on the considered databases.

    [Keywords: pattern recognition, Hidden Markov Models, neural networks, hand-writing recognition]

  11. ⁠, Alexander Novikov, Dmitry Podoprikhin, Anton Osokin, Dmitry Vetrov (2015-09-22):

    Deep neural networks currently demonstrate state-of-the-art performance in several domains. At the same time, models of this class are very demanding in terms of computational resources. In particular, a large amount of memory is required by commonly used fully-connected layers, making it hard to use the models on low-end devices and stopping the further increase of the model size. In this paper we convert the dense weight matrices of the fully-connected layers to the Tensor Train format such that the number of parameters is reduced by a huge factor and at the same time the expressive power of the layer is preserved. In particular, for the Very Deep VGG networks we report the compression factor of the dense weight matrix of a fully-connected layer up to 200000 times leading to the compression factor of the whole network up to 7 times.

  12. ⁠, Gregor Urban, Krzysztof J. Geras, Samira Ebrahimi Kahou, Ozlem Aslan, Shengjie Wang, Rich Caruana, Abdelrahman Mohamed, Matthai Philipose, Matt Richardson (2016-03-17):

    Figure 1: Accuracy of student models with different architectures trained to mimic the CIFAR10 ensemble. The average performance of the 5 best models of each hyperparameter-optimization experiment is shown, together with dashed lines indicating the accuracy of the best and the fifth best model from each setting. The short horizontal lines at 10M parameters are the accuracy of models trained without compression on the original 0/​​​​1 hard targets.

    Yes, they do. This paper provides the first empirical demonstration that deep convolutional models really need to be both deep and convolutional, even when trained with methods such as distillation that allow small or shallow models of high accuracy to be trained.

    Although previous research showed that shallow feed-forward nets sometimes can learn the complex functions previously learned by deep nets while using the same number of parameters as the deep models they mimic, in this paper we demonstrate that the same methods cannot be used to train accurate models on CIFAR-10 unless the student models contain multiple layers of convolution. Although the student models do not have to be as deep as the teacher model they mimic, the students need multiple convolutional layers to learn functions of comparable accuracy as the deep convolutional teacher.

    Figure 1 summarizes the results in Table 2 for student models of different depth, number of convolutional layers, and number of parameters when trained to mimic the ensemble teacher model. models trained on the ensemble logits are able to achieve accuracies previously unseen on CIFAR-10 for models with so few layers. Also, it is clear that there is a huge gap between the convolutional student models at the top of the figure, and the non-convolutional student models at the bottom of the figure: the most accurate student MLP has accuracy less than 75%, while the least accurate convolutional student model with the same number of parameters but only one convolutional layer has accuracy above 87%. And the accuracy of the convolutional student models increases further as more layers of convolution are added. Interestingly, the most accurate student MLPs with no convolutional layers have only 2 or 3 hidden layers; the student MLPs with 4 or 5 hidden layers are not as accurate.

    Comparing the student MLP with only one hidden layer (bottom of the graph) to the student CNN with 1 convolutional layer clearly suggests that convolution is critical for this problem even when models are trained via distillation, and that it is very unlikely that a shallow non-convolutional model with 100 million parameters or less could ever achieve accuracy comparable to a convolutional model. It appears that if convolution is critical for teacher models trained on the original 0/​​​​1 hard targets, it is likely to be critical for student models trained to mimic these teacher models. Adding depth to the student MLPs without adding convolution does not substantially close this “convolutional gap”.

  13. https://arxiv.org/pdf/1603.05691.pdf#page=7

  14. ⁠, Preetum Nakkiran, Behnam Neyshabur, Hanie Sedghi (2020-10-16):

    We propose a new framework for reasoning about generalization in deep learning. The core idea is to couple the Real World, where optimizers take stochastic gradient steps on the empirical loss, to an Ideal World, where optimizers take steps on the population loss. This leads to an alternate decomposition of test error into: (1) the Ideal World test error plus (2) the gap between the two worlds. If the gap (2) is universally small, this reduces the problem of generalization in offline learning to the problem of optimization in online learning. We then give empirical evidence that this gap between worlds can be small in realistic deep learning settings, in particular supervised image classification. For example, CNNs generalize better than MLPs on image distributions in the Real World, but this is “because” they optimize faster on the population loss in the Ideal World. This suggests our framework is a useful tool for understanding generalization in deep learning, and lays a foundation for future research in the area.

  15. ⁠, Jacob Devlin (2017-05-04):

    Attentional sequence-to-sequence models have become the new standard for machine translation, but one challenge of such models is a increase in training and decoding cost compared to phrase-based systems. Here, we focus on efficient decoding, with a goal of achieving accuracy close the state-of-the-art in neural machine translation (NMT), while achieving CPU decoding speed/​​​​throughput close to that of a phrasal decoder.

    We approach this problem from two angles: First, we describe several techniques for speeding up an NMT decoder, which obtain a 4.4× speedup over a very efficient baseline decoder without changing the decoder output. Second, we propose a simple but powerful network architecture which uses an (GRU/​​​​LSTM) layer at bottom, followed by a series of stacked fully-connected layers applied at every timestep. This architecture achieves similar accuracy to a deep recurrent model, at a small fraction of the training and decoding cost. By combining these techniques, our best system achieves a very competitive accuracy of 38.3 on WMT English-French NewsTest2014, while decoding at 100 words/​​​​sec on single-threaded CPU. We believe this is the best published accuracy/​​​​speed trade-off of an NMT system.

  16. ⁠, David Balduzzi, Marcus Frean, Lennox Leary, JP Lewis, Kurt Wan-Duo Ma, Brian McWilliams (2017-02-28):

    A long-standing obstacle to progress in deep learning is the problem of vanishing and exploding gradients. Although, the problem has largely been overcome via carefully constructed initializations and batch normalization, architectures incorporating skip-connections such as highway and perform much better than standard feedforward architectures despite well-chosen initialization and batch normalization. In this paper, we identify the shattered gradients problem. Specifically, we show that the correlation between gradients in standard feedforward networks decays exponentially with depth resulting in gradients that resemble white noise whereas, in contrast, the gradients in architectures with skip-connections are far more resistant to shattering, decaying sublinearly. Detailed empirical evidence is presented in support of the analysis, on both fully-connected networks and convnets. Finally, we present a new “looks linear” (LL) initialization that prevents shattering, with preliminary experiments showing the new initialization allows to train very deep networks without the addition of skip-connections.

  17. ⁠, Andrew Brock, Soham De, Samuel L. Smith, Karen Simonyan (2021-02-11):

    Batch normalization is a key component of most image classification models, but it has many undesirable properties stemming from its dependence on the batch size and interactions between examples. Although recent work has succeeded in training deep ResNets without normalization layers, these models do not match the test accuracies of the best batch-normalized networks, and are often unstable for large learning rates or strong data augmentations. In this work, we develop an adaptive gradient clipping technique which overcomes these instabilities, and design a significantly improved class of Normalizer-Free ResNets. Our smaller models match the test accuracy of an -B7 on ImageNet while being up to 8.7× faster to train, and our largest models attain a new state-of-the-art top-1 accuracy of 86.5%. In addition, Normalizer-Free models attain significantly better performance than their batch-normalized counterparts when finetuning on after large-scale pre-training on a dataset of 300 million labeled images, with our best models obtaining an accuracy of 89.2%. Our code is available at https:/​​​​/​​​​github.com/​​​​deepmind/ deepmind-research/​​​​tree/​​​​master/​​​​nfnets

  18. ⁠, A. Emin Orhan, Xaq Pitkow (2017-01-31):

    Skip connections made the training of very deep networks possible and have become an indispensable component in a variety of neural architectures. A completely satisfactory explanation for their success remains elusive. Here, we present a novel explanation for the benefits of skip connections in training very deep networks. The difficulty of training deep networks is partly due to the singularities caused by the non-identifiability of the model. Several such singularities have been identified in previous works: (i) overlap singularities caused by the permutation symmetry of nodes in a given layer, (ii) elimination singularities corresponding to the elimination, i.e. consistent deactivation, of nodes, (iii) singularities generated by the linear dependence of the nodes. These singularities cause degenerate manifolds in the loss landscape that slow down learning. We argue that skip connections eliminate these singularities by breaking the permutation symmetry of nodes, by reducing the possibility of node elimination and by making the nodes less linearly dependent. Moreover, for typical initializations, skip connections move the network away from the “ghosts” of these singularities and sculpt the landscape around them to alleviate the learning slow-down. These hypotheses are supported by evidence from simplified models, as well as from experiments with deep networks trained on real-world datasets.

  19. ⁠, Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S. Schoenholz, Jeffrey Pennington (2018-06-14):

    In recent years, state-of-the-art methods in computer vision have utilized increasingly deep convolutional neural network architectures (CNNs), with some of the most successful models employing hundreds or even thousands of layers. A variety of pathologies such as vanishing/​​​​exploding gradients make training such deep networks challenging. While residual connections and batch normalization do enable training at these depths, it has remained unclear whether such specialized architecture designs are truly necessary to train deep CNNs. In this work, we demonstrate that it is possible to train vanilla CNNs with ten thousand layers or more simply by using an appropriate initialization scheme. We derive this initialization scheme theoretically by developing a mean field theory for signal propagation and by characterizing the conditions for dynamical isometry, the equilibration of singular values of the input-output Jacobian matrix. These conditions require that the convolution operator be an orthogonal transformation in the sense that it is norm-preserving. We present an algorithm for generating such random initial orthogonal convolution kernels and demonstrate empirically that they enable efficient training of extremely deep architectures.

  20. ⁠, Marco Ciccone, Marco Gallieri, Jonathan Masci, Christian Osendorfer, Faustino Gomez (2018-04-19):

    This paper introduces Non-Autonomous Input-Output Stable Network (NAIS-Net), a very deep architecture where each stacked processing block is derived from a time-invariant non-autonomous dynamical system.

    Non-autonomy is implemented by skip connections from the block input to each of the unrolled processing stages and allows stability to be enforced so that blocks can be unrolled adaptively to a pattern-dependent processing depth. NAIS-Net induces non-trivial, input-output maps, even for an infinite unroll length.

    We prove that the network is globally asymptotically stable so that for every initial condition there is exactly one input-dependent equilibrium assuming tanh units, and incrementally stable for units. An efficient implementation that enforces the stability under derived conditions for both fully-connected and convolutional layers is also presented. Experimental results show how NAIS-Net exhibits stability in practice, yielding a substantial reduction in generalization gap compared to ResNets.

  21. ⁠, Yuehaw Khoo, Lexing Ying (2018-10-23):

    We propose a novel neural network architecture, SwitchNet, for solving the wave equation based inverse scattering problems via providing maps between the scatterers and the scattered field (and vice versa). The main difficulty of using a neural network for this problem is that a scatterer has a global impact on the scattered wave field, rendering typical convolutional neural network with local connections inapplicable. While it is possible to deal with such a problem using a fully connected network, the number of parameters grows quadratically with the size of the input and output data. By leveraging the inherent low-rank structure of the scattering problems and introducing a novel switching layer with sparse connections, the SwitchNet architecture uses much fewer parameters and facilitates the training process. Numerical experiments show promising accuracy in learning the forward and inverse maps between the scatterers and the scattered wave field.

  22. ⁠, Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H. Nguyen, Madeleine Gibescu, Antonio Liotta (2018-06-19):

    Through the success of deep learning in various domains, artificial neural networks are currently among the most used artificial intelligence methods. Taking inspiration from the network properties of biological neural networks (eg. sparsity, scale-freeness), we argue that (contrary to general practice) artificial neural networks, too, should not have fully-connected layers.

    Here we propose sparse evolutionary training of artificial neural networks, an algorithm which evolves an initial sparse topology (Erdős-Rényi random graph) of 2 consecutive layers of neurons into a scale-free topology, during learning. Our method replaces artificial neural networks’ fully-connected layers with sparse ones before training, reducing quadratically the number of parameters, with no decrease in accuracy. We demonstrate our claims on restricted Boltzmann machines, multi-layer perceptrons, and convolutional neural networks for unsupervised and supervised learning on 15 datasets.

    Our approach has the potential to enable artificial neural networks to scale up beyond what is currently possible.

  23. ⁠, Stéphane d'Ascoli, Levent Sagun, Joan Bruna, Giulio Biroli (2019-06-16):

    Despite the phenomenal success of deep neural networks in a broad range of learning tasks, there is a lack of theory to understand the way they work. In particular, Convolutional Neural Networks (CNNs) are known to perform much better than Fully-Connected Networks (FCNs) on spatially structured data: the architectural structure of CNNs benefits from knowledge on the features of the data, for instance their translation invariance. The aim of this work is to understand this fact through the lens of dynamics in the loss landscape.

    We introduce a method that maps a CNN to its equivalent FCN (denoted as eFCN). Such an embedding enables the comparison of CNN and FCN training dynamics directly in the FCN space. We use this method to test a new training protocol, which consists in training a CNN, embedding it to FCN space at a certain “relax time”, then resuming the training in FCN space. We observe that for all relax times, the deviation from the CNN subspace is small, and the final performance reached by the eFCN is higher than that reachable by a standard FCN of same architecture. More surprisingly, for some intermediate relax times, the eFCN outperforms the CNN it stemmed, by combining the prior information of the CNN and the expressivity of the FCN in a complementary way. The practical interest of our protocol is limited by the very large size of the highly sparse eFCN. However, it offers interesting insights into the persistence of architectural bias under stochastic gradient dynamics. It shows the existence of some rare basins in the FCN loss landscape associated with very good generalization. These can only be accessed thanks to the CNN prior, which helps navigate the landscape during the early stages of optimization.

  24. ⁠, Thomas Bachlechner, Bodhisattwa Prasad Majumder, Huanru Henry Mao, Garrison W. Cottrell, Julian McAuley (2020-03-10):

    Deep networks often suffer from vanishing or exploding gradients due to inefficient signal propagation, leading to long training times or convergence difficulties. Various architecture designs, sophisticated residual-style networks, and initialization schemes have been shown to improve deep signal propagation. Recently, Pennington et al used free probability theory to show that dynamical isometry plays an integral role in efficient deep learning. We show that the simplest architecture change of gating each residual connection using a single zero-initialized parameter satisfies initial dynamical isometry and outperforms more complex approaches. Although much simpler than its predecessors, this gate enables training thousands of fully connected layers with fast convergence and better test performance for ResNets trained on CIFAR-10. We apply this technique to language modeling and find that we can easily train 120-layer Transformers. When applied to 12 layer Transformers, it converges 56% faster on enwiki8.

  25. ⁠, Behnam Neyshabur (2020-07-27):

    Convolution is one of the most essential components of architectures used in computer vision. As machine learning moves towards reducing the expert bias and learning it from data, a natural next step seems to be learning convolution-like structures from scratch. This, however, has proven elusive. For example, current state-of-the-art architecture search algorithms use convolution as one of the existing modules rather than learning it from data.

    In an attempt to understand the inductive bias that gives rise to convolutions, we investigate minimum description length as a guiding principle and show that in some settings, it can indeed be indicative of the performance of architectures./​​​​p>

    To find architectures with small description length, we propose β-⁠, a simple variant of LASSO algorithm that, when applied on fully-connected networks for image classification tasks, learns architectures with local connections and achieves state-of-the-art accuracies for training fully-connected nets on CIFAR-10 (85.19%), CIFAR-100 (59.56%) and SVHN (94.07%) bridging the gap between fully-connected and convolutional nets.

  26. ⁠, Franco Pellegrini, Giulio Biroli (2021-04-27):

    Pruning methods can considerably reduce the size of artificial neural networks without harming their performance. In some cases, they can even uncover sub-networks that, when trained in isolation, match or surpass the test accuracy of their dense counterparts. Here we study the inductive bias that pruning imprints in such “winning lottery tickets”. Focusing on visual tasks, we analyze the architecture resulting from iterative magnitude pruning of a simple fully connected network (FCN). We show that the surviving node connectivity is local in input space, and organized in patterns reminiscent of the ones found in convolutional networks (CNN). We investigate the role played by data and tasks in shaping the architecture of pruned sub-networks. Our results show that the winning lottery tickets of FCNs display the key features of CNNs. The ability of such automatic network-simplifying procedure to recover the key features “hand-crafted” in the design of CNNs suggests interesting applications to other datasets and tasks, in order to discover new and efficient architectural inductive biases.

  27. ⁠, Roman Pogodin, Yash Mehta, Timothy P. Lillicrap, Peter E. Latham (2021-06-22):

    Convolutional networks are ubiquitous in deep learning. They are particularly useful for images, as they reduce the number of parameters, reduce training time, and increase accuracy. However, as a model of the brain they are seriously problematic, since they require weight sharing—something real neurons simply cannot do. Consequently, while neurons in the brain can be locally connected (one of the features of convolutional networks), they cannot be convolutional. Locally connected but non-convolutional networks, however, significantly underperform convolutional ones. This is troublesome for studies that use convolutional networks to explain activity in the visual system. Here we study plausible alternatives to weight sharing that aim at the same regularization principle, which is to make each neuron within a pool react similarly to identical inputs. The most natural way to do that is by showing the network multiple translations of the same image, akin to saccades in animal vision. However, this approach requires many translations, and doesn’t remove the performance gap. We propose instead to add lateral connectivity to a locally connected network, and allow learning via Hebbian plasticity. This requires the network to pause occasionally for a sleep-like phase of “weight sharing”. This method enables locally connected networks to achieve nearly convolutional performance on ImageNet, thus supporting convolutional networks as a model of the visual stream.

  28. ⁠, John D. Martin, Joseph Modayil (2021-06-17):

    The performance of a (RL) system depends on the computational architecture used to approximate a value function. Deep learning methods provide both optimization techniques and architectures for approximating nonlinear functions from noisy, high-dimensional observations. However, prevailing optimization techniques are not designed for strictly-incremental online updates. Nor are standard architectures designed for observations with an a priori unknown structure: for example, light sensors randomly dispersed in space. This paper proposes an online RL prediction algorithm with an adaptive architecture that efficiently finds useful nonlinear features. The algorithm is evaluated in a spatial domain with high-dimensional, stochastic observations. The algorithm outperforms non-adaptive baseline architectures and approaches the performance of an architecture given side-channel information. These results are a step towards scalable RL algorithms for more general problems, where the observation structure is not available.

  29. ⁠, Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexanderson, Iolanda Leite, Hedvig Kjellström (2020-01-25):

    During speech, people spontaneously gesticulate, which plays a key role in conveying information. Similarly, realistic co-speech gestures are crucial to enable natural and smooth interactions with social agents. Current end-to-end co-speech gesture generation systems use a single modality for representing speech: either audio or text. These systems are therefore confined to producing either acoustically-linked beat gestures or semantically-linked gesticulation (e.g., raising a hand when saying “high”): they cannot appropriately learn to generate both gesture types. We present a model designed to produce arbitrary beat and semantic gestures together. Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output. The resulting gestures can be applied to both virtual agents and humanoid robots. Subjective and objective evaluations confirm the success of our approach. The code and video are available at the project page https:/​​​​/​​​​svito-zar.github.io/​​​​gesticulator .

  30. ⁠, Xiaohan Ding, Xiangyu Zhang, Jungong Han, Guiguang Ding (2021-05-05):

    We propose RepMLP, a multi-layer-perceptron-style neural network building block for image recognition, which is composed of a series of fully-connected (FC) layers. Compared to convolutional layers, FC layers are more efficient, better at modeling the long-range dependencies and positional patterns, but worse at capturing the local structures, hence usually less favored for image recognition. We propose a structural re-parameterization technique that adds local prior into an FC to make it powerful for image recognition. Specifically, we construct convolutional layers inside a RepMLP during training and merge them into the FC for inference. On CIFAR, a simple pure-MLP model shows performance very close to CNN. By inserting RepMLP in traditional CNN, we improve ResNets by 1.8% accuracy on ImageNet, 2.9% for face recognition, and 2.3% mIoU on Cityscapes with lower FLOPs. Our intriguing findings highlight that combining the global representational capacity and positional perception of FC with the local prior of convolution can improve the performance of neural network with faster speed on both the tasks with translation invariance (e.g., semantic segmentation) and those with aligned images and positional patterns (e.g., face recognition). The code and models are available at https:/​​​​/​​​​github.com/​​​​DingXiaoH/​​​​RepMLP.

  31. ⁠, Zizheng Pan, Bohan Zhuang, Haoyu He, Jing Liu, Jianfei Cai (2021-05-29):

    Transformers have become one of the dominant architectures in deep learning, particularly as a powerful alternative to convolutional neural networks (CNNs) in computer vision. However, Transformer training and inference in previous works can be prohibitively expensive due to the quadratic complexity of self-attention over a long sequence of representations, especially for high-resolution dense prediction tasks. To this end, we present a novel Less attention vIsion Transformer (LIT), building upon the fact that convolutions, fully-connected (FC) layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences. Specifically, we propose a hierarchical Transformer where we use pure multi-layer perceptrons (MLPs) to encode rich local patterns in the early stages while applying self-attention modules to capture longer dependencies in deeper layers. Moreover, we further propose a learned deformable token merging module to adaptively fuse informative patches in a non-uniform manner. The proposed LIT achieves promising performance on image recognition tasks, including image classification, and instance segmentation, serving as a strong backbone for many vision tasks. Code is available at: https:/​​​​/​​​​github.com/​​​​MonashAI/​​​​LIT

  32. ⁠, Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo (2021-05-31):

    We present SegFormer, a simple, efficient yet powerful semantic segmentation framework which unifies Transformers with lightweight multilayer perception (MLP) decoders. SegFormer has two appealing features: 1) SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of positional codes which leads to decreased performance when the testing resolution differs from training. 2) SegFormer avoids complex decoders. The proposed MLP decoder aggregates information from different layers, and thus combining both local attention and global attention to render powerful representations. We show that this simple and lightweight design is the key to efficient segmentation on Transformers. We scale our approach up to obtain a series of models from SegFormer-B0 to SegFormer-B5, reaching significantly better performance and efficiency than previous counterparts. For example, SegFormer-B4 achieves 50.3% mIoU on ADE20K with 64M parameters, being 5× smaller and 2.2% better than the previous best method. Our best model, SegFormer-B5, achieves 84.0% mIoU on Cityscapes validation set and shows excellent zero-shot robustness on Cityscapes-C. Code will be released at: github.com/​​​​NVlabs/​​​​SegFormer.

  33. ⁠, Arlind Kadra, Marius Lindauer, Frank Hutter, Josif Grabocka (2021-06-21):

    Tabular datasets are the last “unconquered castle” for deep learning, with traditional ML methods like Gradient-Boosted Decision Trees still performing strongly even against recent specialized neural architectures. In this paper, we hypothesize that the key to boosting the performance of neural networks lies in rethinking the joint and simultaneous application of a large set of modern regularization techniques. As a result, we propose regularizing plain Multilayer Perceptron (MLP) networks by searching for the optimal combination/​​​​cocktail of 13 regularization techniques for each dataset using a joint optimization over the decision on which regularizers to apply and their subsidiary hyperparameters. We empirically assess the impact of these regularization cocktails for MLPs on a large-scale empirical study comprising 40 tabular datasets and demonstrate that (i) well-regularized plain MLPs significantly outperform recent state-of-the-art specialized neural network architectures, and (ii) they even outperform strong traditional ML methods, such as XGBoost.

  34. ⁠, Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy (2021-05-04):

    [blog] Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the ⁠, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary.

    We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (ie. “mixing” the per-location features), and one with MLPs applied across patches (ie. “mixing” spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models.

    We hope that these results spark further research beyond the realms of well established CNNs and Transformers.

  35. ⁠, Xiangning Chen, Cho-Jui Hsieh, Boqing Gong (2021-06-03):

    Vision Transformers (ViTs) and MLPs signal further efforts on replacing hand-wired features or inductive biases with general-purpose neural architectures. Existing works empower the models by massive data, such as large-scale pretraining and/​​​​or repeated strong data augmentations, and still report optimization-related problems (e.g., sensitivity to initialization and learning rate). Hence, this paper investigates ViTs and MLP-Mixers from the lens of loss geometry, intending to improve the models’ data efficiency at training and generalization at inference. Visualization and Hessian reveal extremely sharp local minima of converged models. By promoting smoothness with a recently proposed sharpness-aware optimizer, we substantially improve the accuracy and robustness of ViTs and MLP-Mixers on various tasks spanning supervised, adversarial, ⁠, and transfer learning (e.g., +5.3% and +11.0% top-1 accuracy on ImageNet for ViT-B/​​​​16 and Mixer-B/​​​​16, respectively, with the simple Inception-style preprocessing). We show that the improved smoothness attributes to sparser active neurons in the first few layers. The resultant ViTs outperform ResNets of similar size and throughput when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations. They also possess more perceptive attention maps.

  36. ⁠, Jaesung Tae, Hyeongju Kim, Younggun Lee (2021-06-15):

    Recent developments in deep learning have significantly improved the quality of synthesized singing voice audio. However, prominent neural singing voice synthesis systems suffer from slow inference speed due to their autoregressive design. Inspired by MLP-Mixer, a novel architecture introduced in the vision literature for attention-free image classification, we propose MLP Singer, a parallel Korean singing voice synthesis system. To the best of our knowledge, this is the first work that uses an entirely MLP-based architecture for voice synthesis. Listening tests demonstrate that MLP Singer outperforms a larger autoregressive -based system, both in terms of audio quality and synthesis speed. In particular, MLP Singer achieves a real-time factor of up to 200 and 3400 on CPUs and GPUs respectively, enabling order of magnitude faster generation on both environments.

  37. ⁠, Tan Yu, Xu Li, Yunfeng Cai, Mingming Sun, Ping Li (2021-06-14):

    Recently, visual Transformer (ViT) and its following works abandon the convolution and exploit the self-attention operation, attaining a comparable or even higher accuracy than CNN. More recently, abandons both the convolution and the self-attention operation, proposing an architecture containing only MLP layers⁠. To achieve cross-patch communications, it devises an additional token-mixing MLP besides the channel-mixing MLP. It achieves promising results when training on an extremely large-scale dataset. But it cannot achieve as outstanding performance as its CNN and ViT counterparts when training on medium-scale datasets such as ImageNet1K and ImageNet21K. The performance drop of MLP-Mixer motivates us to rethink the token-mixing MLP.

    We discover that token-mixing operation in MLP-Mixer is a variant of depthwise convolution with a global reception field and spatial-specific configuration. But the global reception field and the spatial-specific property make token-mixing MLP prone to over-fitting. In this paper, we propose a novel pure MLP architecture, spatial-shift MLP (S2-MLP). Different from MLP-Mixer, our S2-MLP only contains channel-mixing MLP. We devise a spatial-shift operation for achieving the communication between patches. It has a local reception field and is spatial-agnostic. Meanwhile, it is parameter-free and efficient for computation.

    The proposed S2-MLP attains higher recognition accuracy than MLP-Mixer when training on ImageNet-1K dataset. Meanwhile, S2-MLP accomplishes as excellent performance as ViT on ImageNet-1K dataset with considerably simpler architecture and fewer FLOPs and parameters.

  38. ⁠, Tan Yu, Xu Li, Yunfeng Cai, Mingming Sun, Ping Li (2021-08-02):

    Recently, MLP-based vision backbones emerge. MLP-based vision architectures with less inductive bias achieve competitive performance in image recognition compared with CNNs and vision Transformers. Among them, (S2-MLP), adopting the straightforward spatial-shift operation, achieves better performance than the pioneering works including and ⁠. More recently, using smaller patches with a pyramid structure, (ViP) and (GFNet) achieve better performance than S2-MLP.

    In this paper, we improve the S2-MLP vision backbone. We expand the feature map along the channel dimension and split the expanded feature map into several parts. We conduct different spatial-shift operations on split parts.

    Meanwhile, we exploit the split-attention operation to fuse these split parts. Moreover, like the counterparts, we adopt smaller-scale patches and use a pyramid structure for boosting the image recognition accuracy. We term the improved spatial-shift MLP vision backbone as S2-MLPv2. Using 55M parameters, our medium-scale model, S2-MLPv2-Medium achieves an 83.6% top-1 accuracy on the ImageNet-1K benchmark using 224×224px images without self-attention and external training data.

  39. https://raw.githubusercontent.com/lukemelas/do-you-even-need-attention/main/Do-You-Even-Need-Attention.pdf

  40. ⁠, Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Arm, Joulin, Gabriel Synnaeve, Jakob Verbeek, Hervé Jégou (2021-05-07):

    We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) a linear layer in which image patches interact, independently and identically across channels, and (ii) a two-layer feed-forward network in which channels interact independently per patch. When trained with a modern training strategy using heavy data-augmentation and optionally distillation, it attains surprisingly good accuracy/​​​​complexity trade-offs on ImageNet. We also train ResMLP models in a self-supervised setup, to further remove priors from employing a labelled dataset. Finally, by adapting our model to machine translation we achieve surprisingly good results.

    We share pre-trained models and our code based on the Timm library.

  41. ⁠, Hanxiao Liu, Zihang Dai, David R. So, Quoc V. Le (2021-05-17):

    Transformers have become one of the most important architectural innovations in deep learning and have enabled many breakthroughs over the past few years. Here we propose a simple network architecture, gMLP, based on MLPs with gating, and show that it can perform as well as Transformers in key language and vision applications. Our comparisons show that self-attention is not critical for Vision Transformers, as gMLP can achieve the same accuracy. For ⁠, our model achieves parity with Transformers on pretraining perplexity and is better on some downstream NLP tasks. On finetuning tasks where gMLP performs worse, making the gMLP model substantially larger can close the gap with Transformers. In general, our experiments show that gMLP can scale as well as Transformers over increased data and compute.

  42. ⁠, Qibin Hou, Zihang Jiang, Li Yuan, Ming-Ming Cheng, Shuicheng Yan, Jiashi Feng (2021-06-23):

    In this paper, we present Vision Permutator, a conceptually simple and data efficient MLP-like architecture for visual recognition. By realizing the importance of the positional information carried by 2D feature representations, unlike recent MLP-like models that encode the spatial information along the flattened spatial dimensions, Vision Permutator separately encodes the feature representations along the height and width dimensions with linear projections. This allows Vision Permutator to capture long-range dependencies along one spatial direction and meanwhile preserve precise positional information along the other direction. The resulting position-sensitive outputs are then aggregated in a mutually complementing manner to form expressive representations of the objects of interest. We show that our Vision Permutators are formidable competitors to convolutional neural networks (CNNs) and vision transformers. Without the dependence on spatial convolutions or attention mechanisms, Vision Permutator achieves 81.5% top-1 accuracy on ImageNet without extra large-scale training data (e.g., ImageNet-22k) using only 25M learnable parameters, which is much better than most CNNs and vision transformers under the same model size constraint. When scaling up to 88M, it attains 83.2% top-1 accuracy. We hope this work could encourage research on rethinking the way of encoding spatial information and facilitate the development of MLP-like models. Code is available at https:/​​​​/​​​​github.com/​​​​Andrew-Qibin/​​​​VisionPermutator.

  43. ⁠, Peng Gao, Jiasen Lu, Hongsheng Li, Roozbeh Mottaghi, Aniruddha Kembhavi (2021-06-02):

    Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations. Recently, Transformers—originally introduced in natural language processing—have been increasingly adopted in computer vision. While early adopters continue to employ CNN backbones, the latest networks are end-to-end CNN-free Transformer solutions. A recent surprising finding shows that a simple MLP-based solution without any traditional convolutional or Transformer components can produce effective visual representations.

    While CNNs, Transformers and may be considered as completely disparate architectures, we provide a unified view showing that they are in fact special cases of a more general method to aggregate spatial context in a neural network stack. We present the CONTAINER (CONText AggregatIon NEtwoRk), a general-purpose building block for multi-head context aggregation that can exploit long-range interactions a la Transformers while still exploiting the inductive bias of the local convolution operation leading to faster convergence speeds, often seen in CNNs.

    In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named CONTAINER-LIGHT, can be employed in object detection and instance segmentation networks such as ⁠, and to obtain an impressive detection mAP of 38.9, 43.8, 45.1 and mask mAP of 41.3, providing large improvements of 6.6, 7.3, 6.9 and 6.6 points respectively, compared to a ResNet-50 backbone with a comparable compute and parameter size.

    Our method also achieves promising results on self-supervised learning compared to on the framework.

  44. ⁠, Shoufa Chen, Enze Xie, Chongjian Ge, Ding Liang, Ping Luo (2021-07-21):

    This paper presents a simple MLP-like architecture, CycleMLP, which is a versatile backbone for visual recognition and dense predictions, unlike modern MLP architectures, e.g., MLP-Mixer, ResMLP, and gMLP, whose architectures are correlated to image size and thus are infeasible in object detection and segmentation. CycleMLP has two advantages compared to modern approaches. (1) It can cope with various image sizes. (2) It achieves linear computational complexity to image size by using local windows. In contrast, previous MLPs have quadratic computations because of their fully spatial connections. We build a family of models that surpass existing MLPs and achieve a comparable accuracy (83.2%) on ImageNet-1K classification compared to the state-of-the-art Transformer such as Swin Transformer (83.3%) but using fewer parameters and FLOPs. We expand the MLP-like models’ applicability, making them a versatile backbone for dense prediction tasks. CycleMLP aims to provide a competitive baseline on object detection, instance segmentation, and semantic segmentation for MLP models. In particular, CycleMLP achieves 45.1 mIoU on ADE20K val, comparable to Swin (45.2 mIOU). Code is available at https:    /​ ​​ ​​ ​​ ​    /​ ​​ ​​ ​​ ​github.com    /​ ​​ ​​ ​​ ​ShoufaChen    /​ ​​ ​​ ​​ ​CycleMLP⁠.

  45. ⁠, Yuki Tatsunami, Masato Taki (2021-08-09):

    For the past ten years, CNN has reigned supreme in the world of computer vision, but recently, Transformer is on the rise. However, the quadratic computational cost of self-attention has become a severe problem of practice. There has been much research on architectures without CNN and self-attention in this context. In particular, MLP-Mixer is a simple idea designed using MLPs and hit an accuracy comparable to the Vision Transformer. However, the only inductive bias in this architecture is the embedding of tokens. Thus, there is still a possibility to build a non-convolutional inductive bias into the architecture itself, and we built in an inductive bias using two simple ideas. A way is to divide the token-mixing block vertically and horizontally. Another way is to make spatial correlations denser among some channels of token-mixing. With this approach, we were able to improve the accuracy of the MLP-Mixer while reducing its parameters and computational complexity. Compared to other MLP-based models, the proposed model, named RaftMLP has a good balance of computational complexity, the number of parameters, and actual memory usage. In addition, our work indicates that MLP-based models have the potential to replace CNNs by adopting inductive bias. The source code in PyTorch version is available at https:    /​ ​​ ​​ ​​ ​    /​ ​​ ​​ ​​ ​github.com    /​ ​​ ​​ ​​ ​okojoalg    /​ ​​ ​​ ​​ ​raft-mlp⁠.

  46. ⁠, Dongze Lian, Zehao Yu, Xing Sun, Shenghua Gao (2021-07-18):

    An Axial Shifted MLP architecture (AS-MLP) is proposed in this paper. Different from MLP-Mixer, where the global spatial feature is encoded for the information flow through matrix transposition and one token-mixing MLP, we pay more attention to the local features communication. By axially shifting channels of the feature map, AS-MLP is able to obtain the information flow from different axial directions, which captures the local dependencies. Such an operation enables us to utilize a pure MLP architecture to achieve the same local receptive field as CNN-like architecture. We can also design the receptive field size and dilation of blocks of AS-MLP, etc, just like designing those of convolution kernels. With the proposed AS-MLP architecture, our model obtains 83.3% Top-1 accuracy with 88M parameters and 15.2 GFLOPs on the ImageNet-1K dataset. Such a simple yet effective architecture outperforms all MLP-based architectures and achieves competitive performance compared to the transformer-based architectures (e.g., Swin Transformer) even with slightly lower FLOPs. In addition, AS-MLP is also the first MLP-based architecture to be applied to the downstream tasks (e.g., object detection and semantic segmentation). The experimental results are also impressive. Our proposed AS-MLP obtains 51.5 mAP on the validation set and 49.5 MS mIoU on the ADE20K dataset, which is competitive compared to the transformer-based architectures. Code is available at https:/​​​​/​​​​github.com/​​​​svip-lab/​​​​AS-MLP.

  47. https://arxiv.org/abs/2108.13341#huawei

  48. ⁠, Yuxuan Lou, Fuzhao Xue, Zangwei Zheng, Yang You (2021-09-05):

    Mixture of Experts (MoE) with sparse conditional computation has been proved an effective architecture for scaling attention-based models to more parameters with comparable computation cost. In this paper, we propose Sparse-MLP, scaling the recent MLP-Mixer model with sparse MoE layers, to achieve a more computation-efficient architecture. We replace a subset of dense MLP blocks in the MLP-Mixer model with Sparse blocks. In each Sparse block, we apply two stages of MoE layers: one with MLP experts mixing information within channels along image patch dimension, one with MLP experts mixing information within patches along the channel dimension. Besides, to reduce computational cost in routing and improve experts capacity, we design Re-represent layers in each Sparse block. These layers are to re-scale image representations by two simple but effective linear transformations. By pre-training on ImageNet-1k with MoCo v3 algorithm, our models can outperform dense MLP models with comparable parameters and less computational cost on several downstream image classification tasks.

  49. ⁠, Jiachen Li, Ali Hassani, Steven Walton, Humphrey Shi (2021-09-09):

    MLP-based architectures, which consist of a sequence of consecutive multi-layer perceptron blocks, have recently been found to reach comparable results to convolutional and transformer-based methods. However, most adopt spatial MLPs which take fixed dimension inputs, therefore making it difficult to apply them to downstream tasks, such as object detection and semantic segmentation. Moreover, single-stage designs further limit performance in other computer vision tasks and fully connected layers bear heavy computation. To tackle these problems, we propose ConvMLP: a hierarchical Convolutional MLP for visual recognition, which is a light-weight, stage-wise, co-design of convolution layers, and MLPs. In particular, ConvMLP-S achieves 76.8% top-1 accuracy on ImageNet-1k with 9M parameters and 2.4G MACs (15% and 19% of MLP-Mixer-B/​​​​16, respectively). Experiments on object detection and semantic segmentation further show that visual representation learned by ConvMLP can be seamlessly transferred and achieve competitive results with fewer parameters. Our code and pre-trained models are publicly available at https:/​​​​/​​​​github.com/​​​​SHI-Labs/​​​​Convolutional-MLPs.

  50. ⁠, Chuanxin Tang, Yucheng Zhao, Guangting Wang, Chong Luo, Wenxuan Xie, Wenjun Zeng (2021-09-12):

    Transformers have sprung up in the field of computer vision. In this work, we explore whether the core self-attention module in Transformer is the key to achieving excellent performance in image recognition. To this end, we build an attention-free network called sMLPNet based on the existing MLP-based vision models. Specifically, we replace the MLP module in the token-mixing step with a novel sparse MLP (sMLP) module. For 2D image tokens, sMLP applies 1D MLP along the axial directions and the parameters are shared among rows or columns. By sparse connection and weight sharing, sMLP module significantly reduces the number of model parameters and ⁠, avoiding the common over-fitting problem that plagues the performance of MLP-like models. When only trained on the ImageNet-1K dataset, the proposed sMLPNet achieves 81.9% top-1 accuracy with only 24M parameters, which is much better than most CNNs and vision Transformers under the same model size constraint. When scaling up to 66M parameters, sMLPNet achieves 83.4% top-1 accuracy, which is on par with the state-of-the-art Swin Transformer. The success of sMLPNet suggests that the self-attention mechanism is not necessarily a silver bullet in computer vision. Code will be made publicly available.

  51. Attention

  52. ⁠, Anonymous (2020-09-28):

    We propose an efficient Transformer that eliminates attention.

    We introduce Attention Free Transformer (AFT), an efficient variant of Transformers that eliminates the need for spatial attention. AFT offers great simplicity compared with standard Transformers, where the multi-head attention operation is replaced with the composition of element-wise multiplications/​​​​divisions and global/​​​​local pooling. We provide several variants of AFT along with simple yet efficient implementations that are supported by main stream deep learning libraries. We show that, surprisingly, we are able to train AFT effectively on challenging benchmarks, and also to match or surpass the standard Transformer counterparts.

    [Keywords: Transformers, attention, efficient]

  53. ⁠, Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng (2020-05-02):

    The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models. But is it really required? This paper investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models.

    Via extensive experiments, we find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is useful but not that important after all. To this end, we propose Synthesizer, a model that learns synthetic attention weights without token-token interactions.

    In our experiments, we first show that simple Synthesizers achieve highly competitive performance when compared against vanilla Transformer models across a range of tasks, including machine translation, language modeling, text generation and ⁠/​​​​ benchmarks. When composed with dot product attention, we find that Synthesizers consistently outperform Transformers. Moreover, we conduct additional comparisons of Synthesizers against Dynamic Convolutions, showing that simple Random Synthesizer is not only 60% faster but also improves perplexity by a relative 3.5%. Finally, we show that simple factorized Synthesizers can outperform on encoding only tasks.

  54. ⁠, Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, Hao Ma (2020-06-08):

    Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, training and deploying these models can be prohibitively costly for long sequences, as the standard self-attention mechanism of the Transformer uses 𝒪(n2) time and space with respect to sequence length. In this paper, we demonstrate that the self-attention mechanism can be approximated by a low-rank matrix. We further exploit this finding to propose a new self-attention mechanism, which reduces the overall self-attention complexity from 𝒪(n2) to 𝒪(n) in both time and space. The resulting linear transformer, the Linformer, performs on par with standard Transformer models, while being much more memory-efficient and time-efficient.

  55. ⁠, Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting Zhou, Jonathan May, Hao Ma, Luke Zettlemoyer (2021-06-03):

    The quadratic computational and memory complexities of the Transformer’s attention mechanism have limited its scalability for modeling long sequences. In this paper, we propose Luna, a linear unified nested attention mechanism that approximates softmax attention with two nested linear attention functions, yielding only linear (as opposed to quadratic) time and space complexity. Specifically, with the first attention function, Luna packs the input sequence into a sequence of fixed length. Then, the packed sequence is unpacked using the second attention function. As compared to a more traditional attention mechanism, Luna introduces an additional sequence with a fixed length as input and an additional corresponding output, which allows Luna to perform attention operation linearly, while also storing adequate contextual information. We perform extensive evaluations on three benchmarks of sequence modeling tasks: long-context sequence modeling, neural machine translation and masked language modeling for large-scale pretraining. Competitive or even better experimental results demonstrate both the effectiveness and efficiency of Luna compared to a variety

  56. ⁠, Meng-Hao Guo, Zheng-Ning Liu, Tai-Jiang Mu, Shi-Min Hu (2021-05-05):

    Attention mechanisms, especially self-attention, have played an increasingly important role in deep feature representation for visual tasks. Self-attention updates the feature at each position by computing a weighted sum of features using pair-wise affinities across all positions to capture the long-range dependency within a single sample. However, self-attention has quadratic complexity and ignores potential correlation between different samples. This paper proposes a novel attention mechanism which we call external attention, based on two external, small, learnable, shared memories, which can be implemented easily by simply using two cascaded linear layers and two normalization layers; it conveniently replaces self-attention in existing popular architectures. External attention has linear complexity and implicitly considers the correlations between all data samples. We further incorporate the multi-head mechanism into external attention to provide an all-MLP architecture, external attention MLP (EAMLP), for image classification. Extensive experiments on image classification, object detection, semantic segmentation, instance segmentation, image generation, and point cloud analysis reveal that our method provides results comparable or superior to the self-attention mechanism and some of its variants, with much lower computational and memory costs.

  57. ⁠, Gustav Eje Henter, Simon Alexanderson, Jonas Beskow (2019-05-16):

    Data-driven modelling and synthesis of motion is an active research area with applications that include animation, games, and social robotics. This paper introduces a new class of probabilistic, generative, and controllable motion-data models based on normalising flows. Models of this kind can describe highly complex distributions, yet can be trained efficiently using exact ⁠, unlike GANs or VAEs. Our proposed model is autoregressive and uses LSTMs to enable arbitrarily long time-dependencies. Importantly, is is also causal, meaning that each pose in the output sequence is generated without access to poses or control inputs from future time steps; this absence of algorithmic latency is important for interactive applications with real-time motion control. The approach can in principle be applied to any type of motion since it does not make restrictive, task-specific assumptions regarding the motion or the character morphology. We evaluate the models on motion-capture datasets of human and quadruped locomotion. Objective and subjective results show that randomly-sampled motion from the proposed method outperforms task-agnostic baselines and attains a motion quality close to recorded motion capture.

  58. ⁠, Tero Karras, Samuli Laine, Timo Aila (2018-12-12):

    We propose an alternative generator architecture for generative adversarial networks, borrowing from literature. The new architecture leads to an automatically learned, unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stochastic variation in the generated images (e.g., freckles, hair), and it enables intuitive, scale-specific control of the synthesis. The new generator improves the state-of-the-art in terms of traditional distribution quality metrics, leads to demonstrably better interpolation properties, and also better disentangles the latent factors of variation. To quantify interpolation quality and disentanglement, we propose two new, automated methods that are applicable to any generator architecture. Finally, we introduce a new, highly varied and high-quality dataset of human faces.

  59. 2018-karras-stylegan-figure1-styleganarchitecture.png

  60. ⁠, Ivan Anokhin, Kirill Demochkin, Taras Khakhulin, Gleb Sterkin, Victor Lempitsky, Denis Korzhenkov (2020-11-27):

    Existing image generator networks rely heavily on spatial convolutions and, optionally, self-attention blocks in order to gradually synthesize images in a coarse-to-fine manner. Here, we present a new architecture for image generators, where the color value at each pixel is computed independently given the value of a random vector and the coordinate of that pixel. No spatial convolutions or similar operations that propagate information across pixels are involved during the synthesis. We analyze the modeling capabilities of such generators when trained in an adversarial fashion, and observe the new generators to achieve similar generation quality to state-of-the-art convolutional generators. We also investigate several interesting properties unique to the new architecture.

  61. ⁠, Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, Anima Anandkumar (2020-10-18):

    The classical development of neural networks has primarily focused on learning mappings between finite-dimensional Euclidean spaces. Recently, this has been generalized to neural operators that learn mappings between function spaces. For partial differential equations (PDEs), neural operators directly learn the mapping from any functional parametric dependence to the solution. Thus, they learn an entire family of PDEs, in contrast to classical methods which solve one instance of the equation. In this work, we formulate a new neural operator by parameterizing the integral kernel directly in Fourier space, allowing for an expressive and efficient architecture. We perform experiments on Burgers’ equation, Darcy flow, and Navier-Stokes equation. The Fourier neural operator is the first ML-based method to successfully model turbulent flows with zero-shot super-resolution. It is up to three orders of magnitude faster compared to traditional PDE solvers. Additionally, it achieves superior accuracy compared to previous learning-based solvers under fixed resolution.

  62. ⁠, Vincent Sitzmann, Julien N. P. Martel, Alexander W. Bergman, David B. Lindell, Gordon Wetzstein (2020-06-17):

    Implicitly defined, continuous, signal representations parameterized by neural networks have emerged as a powerful paradigm, offering many possible benefits over conventional representations. However, current network architectures for such implicit neural representations are incapable of modeling signals with fine detail, and fail to represent a signal’s spatial and temporal derivatives, despite the fact that these are essential to many physical signals defined implicitly as the solution to partial differential equations. We propose to leverage periodic activation functions for implicit neural representations and demonstrate that these networks, dubbed sinusoidal representation networks or Sirens, are ideally suited for representing complex natural signals and their derivatives. We analyze Siren activation statistics to propose a principled initialization scheme and demonstrate the representation of images, wavefields, video, sound, and their derivatives. Further, we show how Sirens can be leveraged to solve challenging boundary value problems, such as particular Eikonal equations (yielding signed distance functions), the Poisson equation, and the Helmholtz and wave equations. Lastly, we combine Sirens with hypernetworks to learn priors over the space of Siren functions.

  63. ⁠, Towaki Takikawa, Joey Litalien, Kangxue Yin, Karsten Kreis, Charles Loop, Derek Nowrouzezahrai, Alec Jacobson, Morgan McGuire, Sanja Fidler (2021-01-26):

    Neural signed distance functions (SDFs) are emerging as an effective representation for 3D shapes. State-of-the-art methods typically encode the SDF with a large, fixed-size neural network to approximate complex shapes with implicit surfaces. Rendering with these large networks is, however, computationally expensive since it requires many forward passes through the network for every pixel, making these representations impractical for real-time graphics. We introduce an efficient neural representation that, for the first time, enables real-time rendering of high-fidelity neural SDFs, while achieving state-of-the-art geometry reconstruction quality. We represent implicit surfaces using an octree-based feature volume which adaptively fits shapes with multiple discrete levels of detail (LODs), and enables continuous LOD with SDF interpolation. We further develop an efficient algorithm to directly render our novel neural SDF representation in real-time by querying only the necessary LODs with sparse octree traversal. We show that our representation is 2–3 orders of magnitude more efficient in terms of rendering speed compared to previous works. Furthermore, it produces state-of-the-art reconstruction quality for complex shapes under both 3D geometric and 2D image-space metrics.

  64. ⁠, Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, Ren Ng (2020-03-19):

    We present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. Our algorithm represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coordinate (spatial location (x,y,z) and viewing direction (Θ, φ)) and whose output is the volume density and view-dependent emitted radiance at that spatial location. We synthesize views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image. Because volume rendering is naturally differentiable, the only input required to optimize our representation is a set of images with known camera poses. We describe how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and view synthesis. View synthesis results are best viewed as videos, so we urge readers to view our supplementary video for convincing comparisons.

  65. ⁠, Christian Reiser, Songyou Peng, Yiyi Liao, Andreas Geiger (2021-03-25):

    NeRF synthesizes novel views of a scene with unprecedented quality by fitting a neural radiance field to RGB images. However, NeRF requires querying a deep Multi-Layer Perceptron (MLP) millions of times, leading to slow rendering times, even on modern GPUs. In this paper, we demonstrate that significant speed-ups are possible by utilizing thousands of tiny MLPs instead of one single large MLP. In our setting, each individual MLP only needs to represent parts of the scene, thus smaller and faster-to-evaluate MLPs can be used. By combining this divide-and-conquer strategy with further optimizations, rendering is accelerated by two orders of magnitude compared to the original NeRF model without incurring high storage costs. Further, using teacher-student distillation for training, we show that this speed-up can be achieved without sacrificing visual quality.

  66. ⁠, George Cazenavette, Manuel Ladron De Guevara (2021-05-28):

    While attention-based transformer networks achieve unparalleled success in nearly all language tasks, the large number of tokens coupled with the quadratic activation memory usage makes them prohibitive for visual tasks. As such, while language-to-language translation has been revolutionized by the transformer model, convolutional networks remain the de facto solution for image-to-image translation. The recently proposed MLP-Mixer architecture alleviates some of the speed and memory issues associated with attention-based networks while still retaining the long-range connections that make transformer models desirable. Leveraging this efficient alternative to self-attention, we propose a new unpaired image-to-image translation model called MixerGAN: a simpler MLP-based architecture that considers long-distance relationships between pixels without the need for expensive attention mechanisms. Quantitative and qualitative analysis shows that MixerGAN achieves competitive results when compared to prior convolutional-based methods.