Dokument: Dense Rewards and Continual Reinforcement Learning for Task-oriented Dialogue Policies

Titel:

Dense Rewards and Continual Reinforcement Learning for Task-oriented Dialogue Policies

URL für Lesezeichen:

https://docserv.uni-duesseldorf.de/servlets/DocumentServlet?id=65630

URN (NBN):

urn:nbn:de:hbz:061-20240429-082103-6

Kollektion:

Dissertationen

Sprache:

Englisch

Dokumententyp:

Wissenschaftliche Abschlussarbeiten » Dissertation

Medientyp:

Text

Autor:

Geishauser, Christian [Autor]

Dateien:

[Dateien anzeigen]	Adobe PDF
[Details]	9,43 MB in einer Datei
[ZIP-Datei erzeugen]
Dateien vom 25.04.2024 / geändert 25.04.2024

Beitragende:

Prof. Dr. Gasic, Milica [Betreuer/Doktorvater]
Prof. Dr. Hakkani-Tür, Dilek [Gutachter]

Dewey Dezimal-Klassifikation:

000 Informatik, Informationswissenschaft, allgemeine Werke » 004 Datenverarbeitung; Informatik

Beschreibung:

Humans continue learning throughout their lifetime in order to adapt to the changing world and advance as the number of challenges to face grows. When faced with a particular task to solve, they ask questions in the right moments to gather information about it and resolve uncertainty about any misunderstandings before actually solving the task.
Task-oriented dialogue systems center around fulfilling the goal or task of a user during a conversation, where the goal is restricted to a certain, fixed, scope of possibilities such as travel planning and schedule organization. Unlike humans, these systems lack the ability for continual learning and neglect the important skill of information gathering or uncertainty reduction in their training.
Nevertheless, the potential range of tasks a dialogue system can assist with is vast due to the expansive nature of human communication. Consequently, the scope of operation inevitably expands and the circumstances evolve, which necessitates the human-like abilities of continual learning. In addition, the dialogue policy, the decision-making component of the system that selects the response to the user, is trained in a trial-and-error process to maximize goal fulfillment. This training procedure is slow, since the only feedback given by goal fulfillment is obtained at the end of a conversation.
The main contributions of this thesis are as follows. Firstly, we propose an additional feedback signal for learning called \textit{information gain} that is provided in every turn of the conversation and thus increases sample efficiency. Information gain encourages the dialogue policy to gather information about the user goal and reduce uncertainty in its understanding, which is an essential step preceding goal fulfillment. Our experiments with different tasks and noise settings show that the additional usage of information gain leads to faster learning and a better final policy.
Secondly, we are the first to introduce continual reinforcement learning for dialogue policies and propose a novel, dynamic architecture called \textit{dynamic dialogue policy transformer} (DDPT). DDPT is based on the Transformer and a pre-trained language model and further advanced with a domain gate and hard-attention to allow dynamic input and output, forward transfer and dealing with many possible tasks. In the continual learning setup where tasks are introduced sequentially over time, our proposal DDPT achieves significant forward transfer and robustness against forgetting, while requiring no growth in neural network parameter size. Thirdly, we propose \textit{realistic environments for continual reinforcement learning of dialogue policies} (RECORD), a more general and controllable continual learning setup with the goal of modeling the most important challenges in continual dialogue policy learning. Furthermore, we propose the usage of lifetime return and meta-gradient reinforcement learning for dialogue policy optimization for more robust learning and better adaptation during continual learning. We test multiple configurations of RECORD with different user behaviors and algorithms, where our proposals of lifetime return and meta-gradient reinforcement learning lead to consistent improvements.
Our results on information gain warrant a more widespread application to any task where information acquisition and uncertainty reduction plays a significant role, such as general information retrieval through dialogue. Furthermore, our contributions on continual learning pave the way for future advancements that move the field of dialogue policies from static to dynamic learning. In that realm, the versatility of RECORD lays the foundation for training and testing continual learning abilities of any task-oriented dialogue system.

Lizenz:

Dieses Werk ist lizenziert unter einer Creative Commons Namensnennung 4.0 International Lizenz

Fachbereich / Einrichtung:

Mathematisch- Naturwissenschaftliche Fakultät » WE Informatik

Dokument erstellt am:

29.04.2024

Dateien geändert am:

29.04.2024

Promotionsantrag am:

30.10.2023

Datum der Promotion:

23.04.2024

Heinrich-Heine-Universität Düsseldorf

Dokument: Dense Rewards and Continual Reinforcement Learning for Task-oriented Dialogue Policies