Reading Times Predict the Quality of Generated Text Above and Beyond Human Ratings

Zarrieß, Sina and Loth, Sebastian and Schlangen, David

Typically, human evaluation of NLG output is based on user ratings. We collected ratings and reading time data in a simple, low-cost experimental paradigm for text generation. Participants were presented corpus texts, automatically linearised texts, and texts containing predicted referring expressions and automatic linearisation. We demonstrate that the reading time metrics outperform the ratings in classifying texts according to their quality. Regression analyses showed that self-reported ratings discriminated poorly between the kinds of manipulation, especially between defects in word order and text coherence. In contrast, a combination of objective measures from the low-cost mouse contingent reading paradigm provided very high classification accuracy and thus, greater insight into the actual quality of an automatically generated text.

In Proceedings of the 15th European Workshop on Natural Language Generation , 2015
[PDF]
@inproceedings{Zarrieß-2015,
  author = {Zarrieß, Sina and Loth, Sebastian and Schlangen, David},
  editor = {Belz, Anya and Gatt, Albert and Portet, François and Purver, Matthew},
  booktitle = {Proceedings of the 15th European Workshop on Natural Language Generation},
  isbn = {978-1-941643-78-5},
  location = {BRIGHTON, Sussex, United Kingdom},
  pages = {38--47},
  publisher = {The Association for Computational Linguistics},
  title = {{Reading Times Predict the Quality of Generated Text Above and Beyond Human Ratings}},
  year = {2015}
}