A multimodal reference to a church among several churches. Target referent is the second from left

### Abstract

Verbal utterances are not the only way of expressing communicative intentions: when something is difficult to describe with words (e.g., the exact shape of a house), speakers often intuitively resort to a range of non-verbal resources, such as hand-drawn sketches, where iconic elements are deployed to achieve communicative success. In these cases, verbal utterances and sketches form multimodal referring expressions (RE) to enable a recipient to idenity target objects. We are interested in enabling robots/virtual agents to generate such multimodal referring expressions (M-REG) and facilicate human-agent communication.

### Methods

We collected a Multimodal object description corpus [2] by augmenting the hand-drawn sketch and image pairs in the SketchyDB with verbal descriptions. This corpus makes it possible to investigate the interplay between iconic (sketch) and symbolic (natural language) semantics in object descriptions. Using a model of grounded language semantics and a model of sketch-to-image mapping, we evaluated the joint contributions of language and sketches with an image retrieval task. We show that adding even very reduced iconic information to a verbal image description improves the performance of the image retrieval model.

With the collect data, we trained two M-REG models and evaluated these models in unimodal and multimodal object identification tasks conducted via crowdsourcing experiments. The results show that partial hand-drawn sketches improve the effectiveness of verbal REs. Interestingly, the language generation system that performs worse in the unimodal condition due to more ambiguous REs leads to better results in the multimodal condition.

### Publications

1. Learning to describe multimodally from parallel unimodal data? A pilot study on verbal and sketched object descriptions In Proceedings of the 22nd Workshop on the Semantics and Pragmatics of Dialogue (AixDial) 2018 [PDF]
BibTeX
@inproceedings{Han-2018,
author = {Han, Ting and Zarrieß, Sina and Komatani, Kazunori and Schlangen, David},
booktitle = {Proceedings of the 22nd Workshop on the Semantics and Pragmatics of Dialogue (AixDial)},
location = {Aix-en-Provence (France)},
title = {{Learning to describe multimodally from parallel unimodal data? A pilot study on verbal and sketched object descriptions}},
year = {2018}
}

Details
2. Natural Language Informs the Interpretation of Iconic Gestures. A Computational Approach In The 8th International Joint Conference on Natural Language Processing. Proceedings of the Conference. Vol. 2: Short Papers 2017 [PDF]
BibTeX
@inproceedings{Han-2017-3,
author = {Han, Ting and Hough, Julian and Schlangen, David},
booktitle = {The 8th International Joint Conference on Natural Language Processing. Proceedings of the Conference. Vol. 2: Short Papers},
isbn = {978-1-948087-01-8},
location = {Taipei},
pages = {134 -- 139},
title = {{Natural Language Informs the Interpretation of Iconic Gestures. A Computational Approach}},
year = {2017}
}

Details
3. Draw and Tell. Multimodal Descriptions Outperform Verbal- or Sketch-Only Descriptions in an Image Retrieval Task In The 8th International Joint Conference on Natural Language Processing. Proceedings of the Conference. Vol. 2: Short Papers 2017 [PDF]
BibTeX
@inproceedings{Han-2017-4,
author = {Han, Ting and Schlangen, David},
booktitle = {The 8th International Joint Conference on Natural Language Processing. Proceedings of the Conference. Vol. 2: Short Papers},
isbn = {978-1-948087-01-8},
location = {Taipei},
pages = {361 -- 365},
title = {{Draw and Tell. Multimodal Descriptions Outperform Verbal- or Sketch-Only Descriptions in an Image Retrieval Task}},
year = {2017}
}

Details