From “Before” to “After”: Generating Natural Language Instructions from Image Pairs in a Simple Visual Domain

Rojowiec, Robin and Götze, Jana and Sadler, Philipp and Voigt, Henrik and Zarrieß, Sina and Schlangen, David

While certain types of instructions can be com-pactly expressed via images, there are situations where one might want to verbalise them, for example when directing someone. We investigate the task of Instruction Generation from Before/After Image Pairs which is to derive from images an instruction for effecting the implied change. For this, we make use of prior work on instruction following in a visual environment. We take an existing dataset, the BLOCKS data collected by Bisk et al. (2016) and investigate whether it is suitable for training an instruction generator as well. We find that it is, and investigate several simple baselines, taking these from the related task of image captioning. Through a series of experiments that simplify the task (by making image processing easier or completely side-stepping it; and by creating template-based targeted instructions), we investigate areas for improvement. We find that captioning models get some way towards solving the task, but have some difficulty with it, and future improvements must lie in the way the change is detected in the instruction.

In Proceedings of the 13th International Conference on Natural Language Generation , 2020
[PDF]
@inproceedings{Rojowiec-2020,
  title = {From {``}Before{''} to {``}After{''}: Generating Natural Language Instructions from Image Pairs in a Simple Visual Domain},
  author = {Rojowiec, Robin and G{\"o}tze, Jana and Sadler, Philipp and Voigt, Henrik and Zarrie{\ss}, Sina and Schlangen, David},
  booktitle = {Proceedings of the 13th International Conference on Natural Language Generation},
  month = dec,
  year = {2020},
  address = {Dublin, Ireland},
  publisher = {Association for Computational Linguistics},
  url = {https://aclanthology.org/2020.inlg-1.38},
  pages = {316--326}
}