Efficient Attention: Attention with Linear Complexities

Efficient Attention: attention with Linear Complexities is a work by myself and colleagues at SenseTime. We proposed a simple but effective method to decrease the computational and memory complexities of the attention mechanism from quadratic to linear, without loss of accuracy. This blog post will introduce the method and major results of the paper.

Motivation

The Attention Mechanism

The attention mechanism is a mechanism in neural networks that allows direct connection between each pair of positions in the input. Its core advantage over recurrence and convolution is its ability to modeling long-range dependency. Following is a diagram depicting a typical attention module.

Image for post
Image for postImage for post

Drawback of Attention

Despite its excellent ability for long-range dependency modeling, attention has a serious drawback. As you can see in the figure above, the intermediate result M's shape is n*n. This means 1) its memory complexity is quadratic; 2) to generate the quadratically many elements, the computational complexity is also quadratic. Consequently, the memory and computational complexities of the entire attention module are quadratic.

Our Method

A Closer Look at the Architecture Diagram

To learn how to solve the issue of quadratic complexities, let’s take a closer look at the architecture diagram of the attention module. The most resource-consuming and the only quadratically complex part of the module is the attention masks M of size n*n.

Image for post
Image for postImage for post
Image for post
Image for postImage for post
Image for post
Image for postImage for post

What about Normalization?

However, the analysis above is only for the vanilla version of attention. In practice, we often apply normalization to the attention masks to stablized training. Will the addition of normalization mess up the analysis? Let’s look into this question with two dominant approaches for attention normalization, scaling normalization and softmax normalization.

Image for post
Image for postImage for post
Image for post
Image for postImage for post

Empirical Validation

Object Detection and Instance Segmentation

We mainly conducted comparative and ablative studies on these two tasks. For these tasks, we use the standard MS-COCO 2017 dataset.

Image for post
Image for postImage for post
Image for post
Image for postImage for post
Image for post
Image for postImage for post
Image for post
Image for postImage for post
Image for post
Image for postImage for post

Stereo Depth Estimation and Image Classification

To prove the generalizability of efficient attention, we also tried it on two other important tasks, stereo depth estimation and image classification.

Image for post
Image for postImage for post
Image for post
Image for postImage for post

Why It Matters?

Efficient attention dramatically reduces the resource needs of the attention mechanism. It offers three advantages over conventional formulations:

  1. Under the same resource budget, efficient attention offers better performance.
  2. In fields where the application of attention wasn’t possible, efficient attention enables the possibilities.

Written by

A year-4 Computer Science student at The University of Hong Kong. Interested in artificial intelligence, computer vision, and natural language processing.

Get the Medium app