CA | ES | EN
Reinforcement learning for value alignment
Reinforcement learning for value alignment
Manel
Manel
 
Rodríguez-Soto
Rodríguez-Soto
 (
26/Jun/2023
26/Jun/2023
)
Reinforcement learning for value alignment
Reinforcement learning for value alignment
 

An industrial PhD

Advisors: 

Maite López-Sánchez; Juan A. Rodríguez-Aguilar

Maite López-Sánchez; Juan A. Rodríguez-Aguilar

University: 

Abstract: 

As autonomous agents become increasingly sophisticated and we allow them to perform
more complex tasks, it is of utmost importance to guarantee that they will act in align-
ment with human values. This problem has received in the AI literature the name of
the value alignment problem. Current approaches apply reinforcement learning to align
agents with values due to its recent successes at solving complex sequential decision-
making problems. However, they follow an agent-centric approach by expecting that
the agent applies the reinforcement learning algorithm correctly to learn an ethical be-
haviour, without formal guarantees that the learnt ethical behaviour will be ethical. This
thesis proposes a novel environment-designer approach for solving the value alignment
problem with theoretical guarantees.
Our proposed environment-designer approach advances the state of the art with a process
for designing ethical environments wherein it is in the agent’s best interest to learn
ethical behaviours. Our process specifies the ethical knowledge of a moral value in terms
that can be used in a reinforcement learning context. Next, our process embeds this
knowledge in the agent’s learning environment to design an ethical learning environment.
The resulting ethical environment incentivises the agent to learn an ethical behaviour
while pursuing its own objective.
We further contribute to the state of the art by providing a novel algorithm that, fol-
lowing our ethical environment design process, is formally guaranteed to create ethical
environments. In other words, this algorithm guarantees that it is in the agent’s best
interest to learn value-aligned behaviours.
We illustrate our algorithm by applying it in a case study environment wherein the
agent is expected to learn to behave in alignment with the moral value of respect. In
it, a conversational agent is in charge of doing surveys, and we expect it to ask the
users questions respectfully while trying to get as much information as possible. In the
designed ethical environment, results confirm our theoretical results: the agent learns
an ethical behaviour while pursuing its individual objective.

As autonomous agents become increasingly sophisticated and we allow them to perform
more complex tasks, it is of utmost importance to guarantee that they will act in align-
ment with human values. This problem has received in the AI literature the name of
the value alignment problem. Current approaches apply reinforcement learning to align
agents with values due to its recent successes at solving complex sequential decision-
making problems. However, they follow an agent-centric approach by expecting that
the agent applies the reinforcement learning algorithm correctly to learn an ethical be-
haviour, without formal guarantees that the learnt ethical behaviour will be ethical. This
thesis proposes a novel environment-designer approach for solving the value alignment
problem with theoretical guarantees.
Our proposed environment-designer approach advances the state of the art with a process
for designing ethical environments wherein it is in the agent’s best interest to learn
ethical behaviours. Our process specifies the ethical knowledge of a moral value in terms
that can be used in a reinforcement learning context. Next, our process embeds this
knowledge in the agent’s learning environment to design an ethical learning environment.
The resulting ethical environment incentivises the agent to learn an ethical behaviour
while pursuing its own objective.
We further contribute to the state of the art by providing a novel algorithm that, fol-
lowing our ethical environment design process, is formally guaranteed to create ethical
environments. In other words, this algorithm guarantees that it is in the agent’s best
interest to learn value-aligned behaviours.
We illustrate our algorithm by applying it in a case study environment wherein the
agent is expected to learn to behave in alignment with the moral value of respect. In
it, a conversational agent is in charge of doing surveys, and we expect it to ask the
users questions respectfully while trying to get as much information as possible. In the
designed ethical environment, results confirm our theoretical results: the agent learns
an ethical behaviour while pursuing its individual objective.