Goal-conditioned reinforcement learning (GCRL) has emerged as a promising avenue in machine learning, offering the potential to master multiple tasks tailored to specific goals and solve diverse real-world problems. However, the inherent complexity of designing dense rewards for real-world scenarios often leads to sparse reward situations. Additionally, the intricacy of these problems frequently necessitates solutions that span long time horizons. Together, these factors hinder effective exploration and cause slow credit assignment, leading to low learning efficiency in GCRL. To improve learning efficiency in the face of these challenges, this thesis encompasses three primary studies. The first study enhances exploration through pre-trained skills acquired from environments where goal-transition patterns resemble those in downstream tasks. These skills are optimised to maximise the local entropy of attained goals during execution. They are then integrated into the exploration phase of downstream tasks using a uniform distribution. While this approach improves exploration efficiency, it does not account for discrepancies between the pre-training environments and downstream tasks. To address this limitation, the second study introduces an adaptive skill distribution that considers environmental structural patterns, which may impact the effectiveness of skills. These patterns are defined based on historical contexts. The adaptive skill distribution learns to identify these patterns by optimizing the local entropy of achieved goals within the given context. It remains consistent across historical contexts with similar structural patterns, enabling the agent to apply appropriate skills in familiar scenarios. This approach facilitates efficient and thorough exploration, even in previously unvisited states. It also supports significant exploration progress in new environments that share known structural patterns, without requiring additional learning. Building on these improvements in exploration, the third study shifts focus to accelerating credit assignment by replacing traditional single-step learning with multi-step learning methods. A key challenge in multi-step off-policy learning is managing off-policy bias. Through detailed analysis and clear differentiation, we categorise off-policy bias into two types: one common to both traditional RL and GCRL, and another unique to GCRL due to its goal-maintenance feature. While we aim to mitigate the detrimental effects of both biases, we also recognise that some off-policy biases can be beneficial. To address this, we introduce an integrated method based on quantile regression and truncation, which leverages beneficial off-policy biases while mitigating harmful ones. This approach accelerates credit assignment and ensures stable, resilient learning outcomes. In summary, this research enhances learning efficiency in GCRL, particularly in scenarios with sparse rewards and long time horizons. By addressing the challenges of exploration and credit assignment, the studies contribute to the development of more robust and adaptable GCRL algorithms, enabling agents to perform better across diverse and complex tasks.
Date of Award | 6 Jan 2025 |
---|
Original language | English |
---|
Awarding Institution | - The University of Manchester
|
---|
Supervisor | Xiaojun Zeng (Supervisor) & Ke Chen (Supervisor) |
---|
- Skills
- Multi-Step Learning
- Sparse Rewards
- Learning Efficiency
- Goal-Conditioned Reinforcement Learning
- Exploration Efficiency
Boosting Learning Efficiency in Goal-Conditioned Reinforcement Learning: Skill Augmentation and Multi-step Learning
Wu, L. (Author). 6 Jan 2025
Student thesis: Phd