Compared to 2012, it now takes 44 times less compute to train a neural network to the level of AlexNet (by contrast, Moore’s Law3 would yield an 11× cost improvement over this period). Our results suggest that for AI tasks with high levels of recent investment, algorithmic progress has yielded more gains than classical hardware efficiency.
…For our analysis, we primarily leveraged open-source re-implementations19,20,21 to measure progress on AlexNet level performance over a long horizon. We saw a similar rate of training efficiency improvement for ResNet-50 level performance on ImageNet (17-month doubling time).7,16 We saw faster rates of improvement over shorter timescales in Translation, Go, and DoTA 2:
- Within translation, the Transformer22 surpassed seq2seq23 performance on English to French translation on WMT’14 with 61× less training compute 3 years later.
- We estimate AlphaZero24 took 8× less compute to get to AlphaGo Zero25 level performance 1 year later.
- OpenAI Five Rerun required 5× less training compute to surpass OpenAI Five26 (which beat the world champions, OG) 3 months later.
It can be helpful to think of compute in 2012 not being equal to compute in 2019 in a similar way that dollars need to be inflation-adjusted over time. A fixed amount of compute could accomplish more in 2019 than in 2012. One way to think about this is that some types of AI research progress in two stages, similar to the “tick tock” model of development seen in semiconductors; new capabilities (the “tick”) typically require a substantial amount of compute expenditure to obtain, then refined versions of those capabilities (the “tock”) become much more efficient to deploy due to process improvements. Increases in algorithmic efficiency allow researchers to do more experiments of interest in a given amount of time and money. In addition to being a measure of overall progress, algorithmic efficiency gains speed up future AI research in a way that’s somewhat analogous to having more compute.
…We also find increases in inference efficiency in terms of GPU time32, parameters16, and flops meaningful, but mostly as a result of their economic implications [ Inference costs dominate total costs for successful deployed systems. Inference costs scale with usage of the system, whereas training costs only need to be paid once.] rather than their effect on future research progress. ShuffleNet13 achieved AlexNet-level performance with an 18× inference efficiency increase in 5 years (15-month doubling time), which suggests that training efficiency and inference efficiency might improve at similar rates.
…For all these reasons, we’re going to start tracking efficiency SOTAs publicly. We’ll start with vision and translation efficiency benchmarks (ImageNet and WMT14), and we’ll consider adding more benchmarks over time. We believe there are efficiency SOTAs on these benchmarks we’re unaware of and encourage the research community to submit them here (we’ll give credit to original authors and collaborators).