An example of a combined deictic / iconic gesture from our corpus, together with the time-aligned transcription
A sketch of the system architecture for incrementally interpreting the speech / gesture ensembles

Abstract

When describing routes not in current view, a common strategy is to anchor the description in configurations of salient landmarks, complementing the verbal descriptions by “placing” the non-visible landmarks in the gesture space. Understanding such multimodal descriptions and later locating the landmarks from real world is a challenging task for the hearer, who must interpret speech and gestures in parallel, fuse information from both modalities, build a mental representation of the description, and ground the knowledge to real world landmarks. This forms a good test case of modelling the interpretation and fusion of multimodal descriptions in situated dialogues. We model the hearer’s task of understanding multimodal spatial descriptions using a collected multimodal spatial description corpus, and investigate how deictic gestures can benifit the performance of a real-time multimodal system.

Methods

We collected a multimodal spatial description corpus, including speech and hand motion data. Participants provided natural spatial scene descriptions with speech and abstract deictic gestures. The scenes were composed of simple geometric objects. While the language denotes object shape and visual properties (e.g., colour), the abstract deictic gestures “placed” objects in gesture space to denote spatial relations of objects. Only together with speech do these gestures receive defined meanings. Our preliminary analysis results show that co-verbal deictic gestures in the corpus reflect spatial configurations of objects, and there are variations of gesture space and verbal descriptions.

We presented a first attempt at providing an end-to-end model of the task of understanding multimodal spatial descriptions, where the undeerstanding can be tested by application of the understanding in a real-world discrimination task (scene retrieval, as shown above). We explored different ways of representing verbal content, from non-compressed word sequences (verbtim representation), over using pre-specified property symbols to learning a set of “concepts” automatically.

We also built a real-time system that incrementally interprets multimodal spatial descriptions. We evaluated the separate and joint contributions of both natural language and deictic gestures in terms of overall system performance and performance on the incremental level. The results show that deictic gestures not only help to improve the overall system performance, but also result in earlier final correct interpretations. Being able to build and apply representations incrementally will be of use in more dialogical settings, where it can enable immediate clarifications.

Publications

(Han and Schlangen 2018), (Han, Kennington, and Schlangen 2018), (Han, Kennington, and Schlangen 2015), (Han, Kousidis, and Schlangen 2014)

  1. Ting Han, and David Schlangen A Corpus of Natural Multimodal Spatial Scene Descriptions The 11th edition of the Language Resources and Evaluation Conference (LREC) 2018 [PDF]
    BibTeX
    @inproceedings{Han-2018-1,
      author = {Han, Ting and Schlangen, David},
      booktitle = {The 11th edition of the Language Resources and Evaluation Conference (LREC)},
      location = {Miyazaki},
      title = {{A Corpus of Natural Multimodal Spatial Scene Descriptions}},
      year = {2018},
      topics = {},
      domains = {},
      approach = {},
      project = {}
    }
    
    Details
  2. Ting Han, Casey Kennington, and David Schlangen Placing Objects in Gesture Space: Toward Real-Time Understanding of Spatial Descriptions Proceedings of the thirty-second AAAI conference on artificial intelligence (AAAI18) 2018 [PDF]
    BibTeX
    @inproceedings{Han-2018-2,
      author = {Han, Ting and Kennington, Casey and Schlangen, David},
      booktitle = {Proceedings of the thirty-second AAAI conference on artificial intelligence (AAAI18)},
      location = {New Orleans},
      publisher = {The association for the advancement of artificial intelligence},
      title = {{Placing Objects in Gesture Space: Toward Real-Time Understanding of Spatial Descriptions}},
      year = {2018},
      topics = {},
      domains = {},
      approach = {},
      project = {}
    }
    
    Details
  3. Ting Han, Casey Kennington, and David Schlangen Building and Applying Perceptually-Grounded Representations of Multimodal Scene Descriptions Proceedings of the 19th SemDial Workshop on the Semantics and Pragmatics of Dialogue (goDIAL) 2015 [PDF]
    BibTeX
    @inproceedings{Han-2015,
      author = {Han, Ting and Kennington, Casey and Schlangen, David},
      booktitle = {Proceedings of the 19th SemDial Workshop on the Semantics and Pragmatics of Dialogue (goDIAL)},
      issn = {2308-2275},
      location = {Gothenburg, Sweden},
      pages = {58--66},
      title = {{Building and Applying Perceptually-Grounded Representations of Multimodal Scene Descriptions}},
      year = {2015},
      topics = {},
      domains = {},
      approach = {},
      project = {}
    }
    
    Details
  4. Ting Han, Spyridon Kousidis, and David Schlangen Towards Automatic Understanding of ‘Virtual Pointing’ in Interaction Proceedings of the 18th SemDial Workshop on the Semantics and Pragmatics of Dialogue (DialWatt), Posters 2014 [PDF]
    BibTeX
    @inproceedings{Han-2014,
      author = {Han, Ting and Kousidis, Spyridon and Schlangen, David},
      booktitle = {Proceedings of the 18th SemDial Workshop on the Semantics and Pragmatics of Dialogue (DialWatt), Posters},
      issn = {2308-2275},
      pages = {188--190},
      title = {{Towards Automatic Understanding of `Virtual Pointing’ in Interaction}},
      year = {2014},
      topics = {},
      domains = {},
      approach = {},
      project = {}
    }
    
    Details

Data