Doctoral thesis

Reinforcement learning with general evaluators and generators of policies

  • 2024

PhD: Università della Svizzera italiana

English Reinforcement Learning (RL) is a subfield of Artificial Intelligence that studies how machines can make decisions by learning from their interactions with an environment. The key aspect of RL is evaluating and improving policies, which dictate the behavior of artificial agents by mapping sensory input to actions. Typically, RL algorithms evaluate these policies using a value function, generally specific to one policy. However, when value functions are updated to track the learned policy, they can forget potentially useful information about previous policies. To address the problem of generalization across many policies, we introduce Parameter-Based Value Functions (PBVFs), a class of value functions that take policy parameters as inputs. A PBVF is a single model capable of evaluating the performance of any policy, given a state, a state-action pair, or a distribution over the RL agent's initial states, and it can generalize across different policies. We derive off-policy actor-critic algorithms based on PBVFs. To input the policy into the value function, we employ a technique called policy fingerprinting. This method compresses the policy parameters, rendering PBVFs invariant to changes in the policy architecture. This policy embedding extracts crucial abstract knowledge about the environment, distilled into a limited number of states sufficient to define the behavior of various policies. A policy can improve solely by modifying actions in such states, following the gradient of the value function's predictions. Extensive experiments demonstrate that our method outperforms evolutionary algorithms, demonstrating a more efficient direct search in the policy space. Furthermore, it achieves performance comparable to that of competitive continuous control algorithms. We apply this technique to learn useful representations of Recurrent Neural Network weight matrices, showing its effectiveness in several supervised learning tasks. Lastly, we empirically demonstrate how this approach can be integrated with HyperNetworks to train a single goal-conditioned neural network (NN) capable of generating deep NN policies that achieve any desired return observed during training. The majority of this thesis is based on previous papers published by the author.
  • English
Computer science and technology
License undefined
Open access status
Persistent URL

Document views: 61 File downloads:
  • 2024INF002.pdf: 70