Playpen: An Environment for Exploring Learning From Dialogue Game Feedback

Horst, Nicola and Mazzaccara, Davide and Schmidt, Antonia and Sullivan, Michael and Momentè, Filippo and Franceschetti, Luca and Sadler, Philipp and Hakimov, Sherzod and Testoni, Alberto and Bernardi, Raffaella and Fernández, Raquel and Koller, Alexander and Lemon, Oliver and Schlangen, David and Giulianelli, Mario and Suglia, Alessandro

Interaction between learner and feedback-giver has come into focus recently for post-training of Large Language Models (LLMs), through the use of reward models that judge the appropriateness of a model’s response. In this paper, we investigate whether Dialogue Games—goal-directed and rule-governed activities driven predominantly by verbal actions—can also serve as a source of feedback signals for learning.We introduce Playpen, an environment for off- and online learning through Dialogue Game self-play, and investigate a representative set of post-training methods: supervised fine-tuning; direct alignment (DPO); and reinforcement learning with Group Relative Policy Optimization (GRPO). We experiment with post-training a small LLM (Llama-3.1-8B-Instruct), evaluating performance on unseen instances of training games as well as unseen games, and on standard benchmarks. We find that imitation learning through SFT improves performance on unseen instances, but negatively impacts other skills, while interactive learning with GRPO shows balanced improvements without loss of skills. We release the framework and the baseline training setups to foster research in this promising new direction of “learning in (synthetic) interaction”.

In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , 2025
[PDF]
@inproceedings{Horst-2025,
  title = {Playpen: An Environment for Exploring Learning From Dialogue Game Feedback},
  author = {Horst, Nicola and Mazzaccara, Davide and Schmidt, Antonia and Sullivan, Michael and Moment{\`e}, Filippo and Franceschetti, Luca and Sadler, Philipp and Hakimov, Sherzod and Testoni, Alberto and Bernardi, Raffaella and Fern{\'a}ndez, Raquel and Koller, Alexander and Lemon, Oliver and Schlangen, David and Giulianelli, Mario and Suglia, Alessandro},
  editor = {Christodoulopoulos, Christos and Chakraborty, Tanmoy and Rose, Carolyn and Peng, Violet},
  booktitle = {Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  month = nov,
  year = {2025},
  address = {Suzhou, China},
  publisher = {Association for Computational Linguistics},
  url = {https://aclanthology.org/2025.emnlp-main.1517/},
  doi = {10.18653/v1/2025.emnlp-main.1517},
  pages = {29842--29879},
  isbn = {979-8-89176-332-6}
}