Compared to 2012, it now takes 44× less compute to train a neural network to the level of AlexNet (by contrast, Moore’s Law3 would yield an 11× cost
improvement over this period). Our results suggest that for AI tasks with high levels of recent investment, algorithmic progress has yielded more gains than
classical hardware efficiency.
…For our analysis, we primarily leveraged open-source re-implementations19,20,21 to measure progress on AlexNet level performance over a long
horizon. We saw a similar rate of training efficiency improvement for ResNet-50 level performance on ImageNet (17-month doubling time).7,16 We saw faster rates of improvement
over shorter timescales in Translation, Go, and DoTA 2:
- Within translation, the Transformer22 surpassed seq2seq23 performance on English to French translation on WMT’14 with 61× less training compute 3 years later.
- We estimate AlphaZero24 took 8× less compute to get to AlphaGo Zero25 level performance 1 year later.
- OpenAI Five Rerun required 5× less training compute to surpass OpenAI Five26 (which beat the world champions, OG) 3 months later.
It can be helpful to think of compute in 2012 not being equal to compute in 2019 in a similar way that dollars need to be inflation-adjusted over time. A fixed
amount of compute could accomplish more in 2019 than in 2012. One way to think about this is that some types of AI research progress in two stages, similar to the
“tick tock” model of development seen in semiconductors; new capabilities (the “tick”) typically require a substantial amount of compute expenditure to obtain,
then refined versions of those capabilities (the “tock”) become much more efficient to deploy due to process improvements. Increases in algorithmic efficiency
allow researchers to do more experiments of interest in a given amount of time and money. In addition to being a measure of overall progress, algorithmic
efficiency gains speed up future AI research in a way that’s somewhat analogous to having more compute.
…We also find increases in inference efficiency in terms of GPU time32, parameters16, and
flops meaningful, but mostly as a result of their economic implications [ Inference costs dominate total costs for successful deployed systems. Inference costs
scale with usage of the system, whereas training costs only need to be paid once.] rather than their effect on future research progress. ShuffleNet13
achieved AlexNet-level performance with an 18× inference efficiency increase in 5 years (15-month doubling time), which suggests that training efficiency and
inference efficiency might improve at similar rates.
…For all these reasons, we’re going to start tracking efficiency SOTAs publicly. We’ll start with vision and
translation efficiency benchmarks (ImageNet and WMT14), and we’ll consider
adding more benchmarks over time. We believe there are efficiency SOTAs on these benchmarks we’re unaware
of and encourage the research community to submit them here (we’ll give credit to original authors and collaborators).