research-article

Software-Defined Far Memory in Warehouse-Scale Computers

Authors:
Andres Lagar-Cavilla

Google, Mountain View, CA, USA

Google, Mountain View, CA, USA
View Profile

,
Junwhan Ahn

Google, Mountain View, CA, USA

Google, Mountain View, CA, USA
View Profile

,
Suleiman Souhlal

Google, Mountain View, CA, USA

Google, Mountain View, CA, USA
View Profile

,
Neha Agarwal

Google, Mountain View, CA, USA

Google, Mountain View, CA, USA
View Profile

,
Radoslaw Burny

Google, Mountain View, CA, USA

Google, Mountain View, CA, USA
View Profile

,
Shakeel Butt

Google, Mountain View, CA, USA

Google, Mountain View, CA, USA
View Profile

,
Jichuan Chang

Google, Mountain View, CA, USA

Google, Mountain View, CA, USA
View Profile

,
Ashwin Chaugule

Google, Mountain View, CA, USA

Google, Mountain View, CA, USA
View Profile

,
Nan Deng

Google, Mountain View, CA, USA

Google, Mountain View, CA, USA
View Profile

,
+ 5
Authors Info & Affiliations

Publication: ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating SystemsApril 2019 Pages 317–330https://doi.org/10.1145/3297858.3304053

ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 317–330

ABSTRACT

Increasing memory demand and slowdown in technology scaling pose important challenges to total cost of ownership (TCO) of warehouse-scale computers (WSCs). One promising idea to reduce the memory TCO is to add a cheaper, but slower, "far memory" tier and use it to store infrequently accessed (or cold) data. However, introducing a far memory tier brings new challenges around dynamically responding to workload diversity and churn, minimizing stranding of capacity, and addressing brownfield (legacy) deployments. We present a novel software-defined approach to far memory that proactively compresses cold memory pages to effectively create a far memory tier in software. Our end-to-end system design encompasses new methods to define performance service-level objectives (SLOs), a mechanism to identify cold memory pages while meeting the SLO, and our implementation in the OS kernel and node agent. Additionally, we design learning-based autotuning to periodically adapt our design to fleet-wide changes without a human in the loop. Our system has been successfully deployed across Google's WSC since 2016, serving thousands of production services. Our software-defined far memory is significantly cheaper (67% or higher memory cost reduction) at relatively good access speeds (6us) and allows us to store a significant fraction of infrequently accessed data (on average, 20%), translating to significant TCO savings at warehouse scale.

References

Advanced Micro Devices Inc. 2018. AMD64 Architecture Programmer's Manual Volume 2: System Programming. https://support.amd.com/TechDocs/24593.pdf Retrieved July 30, 2018 fromGoogle Scholar
Neha Agarwal and Thomas F. Wenisch. 2017. Thermostat: Application-transparent page management for two-tiered main memory. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems . Google ScholarDigital Library
Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard, Jayneel Gandhi, Pratap Subrahmanyam, Lalith Suresh, Kiran Tati, Rajesh Venkatasubramanian, and Michael Wei. 2017. Remote memory in the age of fast networks. In Proceedings of the Symposium on Cloud Computing . Google ScholarDigital Library
Luiz André Barroso, Urs Hölzle, and Parthasarathy Ranganathan. 2018. The Datacenter as a Computer: Designing Warehouse-Scale Machines .Morgan & Claypool Publishers. Google ScholarDigital Library
Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. 2016. Site Reliability Engineering: How Google Runs Production Systems .O'Reilly Media. Google ScholarDigital Library
Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R. Henry, Robert Bradshaw, and Nathan Weizenbaum. 2010. FlumeJava: Easy, efficient data-parallel pipelines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation . Google ScholarDigital Library
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson Hsieh, Deborah Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. 2006. Bigtable: A distributed storage system for structured data. In Proceedings of the Symposium on Operating Systems Design and Implementation . Google ScholarDigital Library
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the Symposium on Operating System Design and Implementation . Google ScholarDigital Library
Subramanya R. Dulloor, Sanjay Kumar, Anil Keshavamurthy, Philip Lantz, Dheeraj Reddy, Rajesh Sankaran, and Jeff Jackson. 2014. System software for persistent memory. In Proceedings of the European Conference on Computer Systems . Google ScholarDigital Library
Subramanya R. Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan Sundaram, Nadathur Satish, Rajesh Sankaran, Jeff Jackson, and Karsten Schwan. 2016. Data tiering in heterogeneous memory systems. In Proceedings of the European Conference on Computer Systems . Google ScholarDigital Library
Assaf Eisenman, Darryl Gardner, Islam AbdelRahman, Jens Axboe, Siying Dong, Kim Hazelwood, Chris Petersen, Asaf Cidon, and Sachin Katti. 2018. Reducing DRAM footprint with NVM in Facebook. In Proceedings of the European Conference on Computer Systems . Google ScholarDigital Library
Magnus Ekman and Per Stenstrom. 2004. A case for multi-level main memory. In Proceedings of the Workshop on Memory Performance Issues . Google ScholarDigital Library
Adam Engst. 1996. RAM Doubler 2. https://tidbits.com/1996/10/28/ram-doubler-2/ Retrieved October 17, 2018 fromGoogle Scholar
Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Elliot Karro, and D. Sculley. 2017. Google Vizier: A service for black-box optimization. In Proceedings of the International Conference on Knowledge Discovery and Data Mining . Google ScholarDigital Library
Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang Shin. 2017. Efficient memory disaggregation with Infiniswap. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation . Google ScholarDigital Library
Intel Corporation. 2016. Intel® 64 and IA-32 Architectures Software Developer's Manual. https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-manual-325462.html Retrieved July 30, 2018 fromGoogle Scholar
Intel Corporation. 2018. Intel Newsroom. Reimagining the Data Center Memory and Storage Hierarchy. https://newsroom.intel.com/editorials/re-architecting-data-center-memory-storage-hierarchy/ Retrieved July 30, 2018 fromGoogle Scholar
Hugo Larochelle Jasper Snoek and Ryan P Adams. 2012. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems . Google ScholarDigital Library
Youngbin Jin, Shihab Mustafa, and Myoungsoo Jung. 2014. Area, power, and latency considerations of STT-MRAM to substitute for main memory. In Proceedings of the Memory Forum .Google Scholar
Ju-Yong Jung and Sangyeun Cho. 2013. Memorage: Emerging persistent RAM based malleable main memory and storage architecture. In Proceedings of the International Conference on Supercomputing . Google ScholarDigital Library
Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. 2015. Profiling a Warehouse-scale Computer. In Proceedings of the International Symposium on Computer Architecture . Google ScholarDigital Library
Uksong Kang, Hak-Soo Yu, Churoo Park, Hongzhong Zheng, John Halbert, Kuljit Bains, S. Jang, and Joo Sun Choi. 2014. Co-architecting controllers and DRAM to enhance DRAM process scaling. Presented at the Memory Forum.Google Scholar
Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger. 2009. Architecting phase-change memory as a scalable DRAM alternative. In Proceedings of the International Symposium on Computer Architecture . Google ScholarDigital Library
Seok-Hee Lee. 2016. Technology scaling challenges and opportunities of memory devices. In Proceedings of the International Electron Devices Meeting .Google ScholarCross Ref
Michel Lespinasse. 2011. Idle page tracking / working set estimation. https://lwn.net/Articles/460762/ Retrieved July 31, 2018 fromGoogle Scholar
Shuang Liang, Ranjit Noronha, and Dhabaleswar K. Panda. 2005. Swapping to remote memory over InfiniBand: An approach using a high performance network block device. In Proceedings of the International Conference on Cluster Computing .Google Scholar
Kevin Lim, Jichuan Chang, Trevor Mudge, Parthasarathy Ranganathan, Steven K. Reinhardt, and Thomas F. Wenisch. 2009. Disaggregated memory for expansion and sharing in blade servers. In Proceedings of the International Symposium on Computer Architecture . Google ScholarDigital Library
Kevin Lim, Yoshio Turner, Jose Renato Santos, Alvin AuYoung, Jichuan Chang, Parthasarathy Ranganathan, and Thomas F. Wenisch. 2012. System-level implications of disaggregated memory. In Proceedings of the International Symposium on High-Performance Computer Architecture . Google ScholarDigital Library
Allyn Malventano. 2018. Intel's Optane DC Persistent Memory DIMMs Push Latency Closer to DRAM. https://www.pcper.com/news/Storage/Intels-Optane-DC-Persistent-Memory-DIMMs-Push-Latency-Closer-DRAM Retrieved December 15, 2018 fromGoogle Scholar
Tom Nelson. 2018. Understanding Compressed Memory on the Mac. https://www.lifewire.com/understanding-compressed-memory-os-x-2260327 Retrieved October 17, 2018 fromGoogle Scholar
Moinuddin K. Qureshi, Vijayalakshmi Srinivasan, and Jude A. Rivers. 2009. Scalable high performance main memory system using phase-change memory technology. In Proceedings of the International Symposium on Computer Architecture . Google ScholarDigital Library
Parthasarathy Ranganathan. 2017. More Moore: Thinking outside the (server) box. Keynote at the International Symposium on Computer Architecture.Google Scholar
Charles Reiss, Alexey Tumanov, Gregory R. Ganger, Randy H. Katz, and Michael A. Kozuch. 2012. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proceedings of the ACM Symposium on Cloud Computing . Google ScholarDigital Library
Arthur Sainio. 2016. NVDIMM -- Changes are here so what's next? Presented at the In-Memory Computing Summit.Google Scholar
Samsung Electronics. 2017. Ultra-Low Latency with Samsung Z-NAND SSD. https://www.samsung.com/us/labs/pdfs/collateral/Samsung_Z-NAND_Technology_Brief_v5.pdf Retrieved July 31, 2018 fromGoogle Scholar
Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. 2010. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the International Conference on Machine Learning . Google ScholarDigital Library
Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale cluster management at Google with Borg. In Proceedings of the European Conference on Computer Systems . Google ScholarDigital Library
Haris Volos, Andres Jaan Tack, and Michael M. Swift. 2011. Mnemosyne: Lightweight persistent memory. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems . Google ScholarDigital Library
Carl A. Waldspurger. 2002. Memory resource management in VMware ESX server. In Proceedings of the Symposium on Operating Systems Design and Implementation . Google ScholarDigital Library
Paul R. Wilson, Scott F. Kaplan, and Yannis Smaragdakis. 1999. The case for compressed caching in virtual memory systems. In Proceedings of the USENIX Annual Technical Conference . Google ScholarDigital Library
Dongliang Xue, Chao Li, Linpeng Huang, Chentao Wu, and Tianyou Li. 2018. Adaptive memory fusion: Towards transparent, agile integration of persistent memory. In Proceedings of the International Symposium on High Performance Computer Architecture .Google ScholarCross Ref
Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, and John Wilkes. 2013. CPItextsuperscript2: CPU performance isolation for shared compute clusters. In Proceedings of the European Conference on Computer Systems . Google ScholarDigital Library
Pin Zhou, Vivek Pandey, Jagadeesan Sundaresan, Anand Raghuraman, Yuanyuan Zhou, and Sanjeev Kumar. 2004. Dynamic tracking of page miss ratio curve for memory management. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems . Google ScholarDigital Library

Index Terms

Software-Defined Far Memory in Warehouse-Scale Computers
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management

Software-Defined Far Memory in Warehouse-Scale Computers

ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

ABSTRACT

References

Index Terms

Software-Defined Far Memory in Warehouse-Scale Computers

Comments

About Cookies On This Site