TY - JOUR
T1 - Scalable Virtual Machine Migration using Reinforcement Learning
AU - Hummaida, Abdul Rahman
AU - Paton, Norman W.
AU - Sakellariou, Rizos
PY - 2022/4/28
Y1 - 2022/4/28
N2 - Heuristic approaches require fixed knowledge of how resource allocation should be carried out, and this can be limiting when managing variable cloud workloads. Solutions based on Reinforcement Learning (RL) have been presented to manage cloud infrastructure, however, these tend to be centralized and suffer in their ability to maintain Quality of Service (QoS) for data centres with thousands of nodes. To address this, we propose a reinforcement learning management policy, which can run decentralized, and achieve fast convergence towards efficient resource allocation, resulting in lower SLA violations compared to centralized architectures. To address some of the common challenges in applying RL to cloud resource management, such as slow learning and state/action management, we use parallel learning and reduction of the state/action space. We apply a decision making approach to optimize the migration of a VM and choose a target node to host the VM in such a way that brings response time within SLA level. We have also demonstrate unique, multi-level reinforcement learning cooperation, that further reduces SLA violations. We use simulation to evaluate and demonstrate our proposal in practice, and compare the results obtained with an established heuristic, demonstrating significant improvement to SLA violations and higher scalability.
AB - Heuristic approaches require fixed knowledge of how resource allocation should be carried out, and this can be limiting when managing variable cloud workloads. Solutions based on Reinforcement Learning (RL) have been presented to manage cloud infrastructure, however, these tend to be centralized and suffer in their ability to maintain Quality of Service (QoS) for data centres with thousands of nodes. To address this, we propose a reinforcement learning management policy, which can run decentralized, and achieve fast convergence towards efficient resource allocation, resulting in lower SLA violations compared to centralized architectures. To address some of the common challenges in applying RL to cloud resource management, such as slow learning and state/action management, we use parallel learning and reduction of the state/action space. We apply a decision making approach to optimize the migration of a VM and choose a target node to host the VM in such a way that brings response time within SLA level. We have also demonstrate unique, multi-level reinforcement learning cooperation, that further reduces SLA violations. We use simulation to evaluate and demonstrate our proposal in practice, and compare the results obtained with an established heuristic, demonstrating significant improvement to SLA violations and higher scalability.
U2 - 10.1007/s10723-022-09603-4
DO - 10.1007/s10723-022-09603-4
M3 - Article
SN - 1570-7873
VL - 20
JO - Journal of Grid Computing
JF - Journal of Grid Computing
IS - 2
ER -