Toggle Main Menu Toggle Search

Open Access padlockePrints

Performance-Aware Speculative Resource Oversubscription for Large-Scale Clusters

Lookup NU author(s): Dr Zhenyu Wen

Downloads

Full text for this publication is not currently held within this repository. Alternative links are provided below where available.


Abstract

© 1990-2012 IEEE.It is a long-standing challenge to achieve a high degree of resource utilization in cluster scheduling. Resource oversubscription has become a common practice in improving resource utilization and cost reduction. However, current centralized approaches to oversubscription suffer from the issue with resource mismatch and fail to take into account other performance requirements, e.g., tail latency. In this article we present ROSE, a new resource management platform capable of conducting performance-aware resource oversubscription. ROSE allows latency-sensitive long-running applications (LRAs) to co-exist with computation-intensive batch jobs. Instead of waiting for resource allocation to be confirmed by the centralized scheduler, job managers in ROSE can independently request to launch speculative tasks within specific machines according to their suitability for oversubscription. Node agents of those machines can however, avoid any excessive resource oversubscription by means of a mechanism for admission control using multi-resource threshold control and performance-aware resource throttle. Experiments show that in case of mixed co-location of batch jobs and latency-sensitive LRAs, the CPU utilization and the disk utilization can reach 56.34 and 43.49 percent, respectively, but the 95th percentile of read latency in YCSB workloads only increases by 5.4 percent against the case of executing the LRAs alone.


Publication metadata

Author(s): Yang R, Hu C, Sun X, Garraghan P, Wo T, Wen Z, Peng H, Xu J, Li C

Publication type: Article

Publication status: Published

Journal: IEEE Transactions on Parallel and Distributed Systems

Year: 2020

Volume: 31

Issue: 7

Pages: 1499-1517

Online publication date: 28/01/2020

Acceptance date: 25/01/2020

ISSN (print): 1045-9219

ISSN (electronic): 1558-2183

Publisher: IEEE Computer Society

URL: https://doi.org/10.1109/TPDS.2020.2970013

DOI: 10.1109/TPDS.2020.2970013


Altmetrics

Altmetrics provided by Altmetric


Share