colab Potsdam | Towards Visual Dialogue: Lexical Knowledge for Situated Interaction

David Schlangen, Sina Zarrieß, Casey Kennington

(These were originally notes for an invited talk.)

Stealing an idea I got from a talk by Charles Sutton, this is the companion website to my talk at the Computational Linguistics Seminar Series at the ILLC, University of Amsterdam, September 26th 2017. Since you’re probably looking at your laptop anyway, this might give you something usefully related to look at…

Table of Contents / Main Points

What is Visual Dialog?
Or, what is it not? Using the system at visualdialog.org as a foil (and with no disrespect to their great work), we motivate a lexicon / knowledge repository for enabling interaction about present objects that can link words both to such objects as well as other knowledge about types of objects.
Referential Interaction & Learning Likenesses
In which I describe the types of interaction from which we learn, namely interactions where a referring expression is observed to be used successfully. This data allows us to learn a simple model for each word that tells us whether a given object (represented by VGG-type features) is a likely member of the word’s extension. These judgements can be combined to form judgements for whole expressions, out of which the quantifier the can take the strongest. Negation is also shown to work. (And we can also describe this as being an interpretation function for our formulae which gives us $[0,1]$ instead of ${0,1}$.) But then we receive a call from the 1970s, complaining about fuzzy logic and semantics…
Inferring Conceptual Relations
The other dimension of the lexicon are relations between words / concepts. I show that we can try to derive those from the likeness classifiers, via a notion of “extensional similarity”. But we can mostly learn from cases where this goes wrong. We can also derive similarity notions from the interactions more directly, which I show to be somewhat similar to similarity judgements derived from large quantities of mono-modal text.
Zero-Shot Learning: Using Inferential Knowledge for Reference Resolution
Zero-shot learning is where the two kinds of knowledge come together. It’s pretty unexciting in this model: If you know that a wampimuk is a kind of rodent, but you don’t know what it looks like, you’re still best off picking the thing that looks like a rodent as referent. I briefly mention some other ways that we’ve integrated inferential knowledge for this task.
Learning Syntax from Referential Interaction
I won’t actually be able to talk about this, because by now I will already have used up most of my time, if not already more than that, but I’m still listing this here because it actually looks like this might be a nice source of information for learning about the structure of referring expressions. Also, we haven’t really done that much on this yet.
Dialogue
We do have a simple visual chat bot for referential interaction, but unfortunately it’s not quite ready for being put online yet. So I’ll just talk about what’s still missing (in terms of actual capabilities, not just technically). Because that directs us nicely to the next chapter:
Proposal for a Visual Dialogue Challenge
In which I outline a data collection / challenge that we are currently setting up, which should produce data that is richer in dialogue phenomena which still being grounded in easily controlled images. (Link to page for that to follow.)
Conclusions
In which I argue / speculate that all this amounts to a causal theory of reference that links us to cavemen and -women; that meanings indeed ain’t in the head, but classifiers are (among other things); that we can build linguistic division of labour into this by modelling teachers; that Wittgenstein would probably not objects; and that there is still a lot to do. I also may say something about whether all is worth anything, even if it isn’t end-to-end™.

Our Relevant Papers

(Zarrieß and Schlangen 2017) (Zarrieß and Schlangen 2017) (Zarrieß and Schlangen 2017) (Zarrieß and Schlangen 2017) (Schlangen, Zarrieß, and Kennington 2016) (Manuvinakurike et al. 2016) (Zarrieß and Schlangen 2016) (Zarrieß and Schlangen 2016) (Schlangen 2016) (Kennington and Schlangen 2015)

Sina Zarrieß, and David Schlangen Deriving continous grounded meaning representations from referentially structured multimodal contexts Proceedings of EMNLP 2017 – Short Papers 2017 [PDF]

BibTeX

@inproceedings{Zarrieß-2017,
  author = {Zarrieß, Sina and Schlangen, David},
  booktitle = {Proceedings of EMNLP 2017 -- Short Papers},
  location = {Copenhagen},
  title = {{Deriving continous grounded meaning representations from referentially structured multimodal contexts}},
  year = {2017},
  topics = {},
  domains = {},
  approach = {},
  project = {}
}

Details

Sina Zarrieß, and David Schlangen Obtaining referential word meanings from visual and distributional information: Experiments on object naming Proceedings of 55th annual meeting of the Association for Computational Linguistics (ACL) 2017 [PDF]

BibTeX

@inproceedings{Zarrieß-2017-1,
  author = {Zarrieß, Sina and Schlangen, David},
  booktitle = {Proceedings of 55th annual meeting of the Association for Computational Linguistics (ACL)},
  title = {{Obtaining referential word meanings from visual and distributional information: Experiments on object naming}},
  year = {2017},
  topics = {},
  domains = {},
  approach = {},
  project = {}
}

Details

Sina Zarrieß, and David Schlangen Is this a Child, a Girl, or a Car? Exploring the Contribution of Distributional Similarity to Learning Referential Word Meanings Short Papers – Proceedings of the Annual Meeting of the European Chapter of the Association for Computational Linguistics (EACL) 2017 [PDF]

BibTeX

@inproceedings{Zarrieß-2017-2,
  author = {Zarrieß, Sina and Schlangen, David},
  booktitle = {Short Papers -- Proceedings of the Annual Meeting of the European Chapter of the Association for Computational Linguistics (EACL)},
  location = {Valencia, Spain},
  title = {{Is this a Child, a Girl, or a Car? Exploring the Contribution of Distributional Similarity to Learning Referential Word Meanings}},
  year = {2017},
  topics = {},
  domains = {},
  approach = {},
  project = {}
}

Details

Sina Zarrieß, and David Schlangen Refer-iTTS: A System for Referring in Spoken Installments to Objects in Real-World Images Proceedings of INLG 2017 (demo papers) 2017 [PDF]

BibTeX

@inproceedings{Zarrieß-2017-3,
  author = {Zarrieß, Sina and Schlangen, David},
  booktitle = {Proceedings of INLG 2017 (demo papers)},
  title = {{Refer-iTTS: A System for Referring in Spoken Installments to Objects in Real-World Images}},
  year = {2017},
  topics = {},
  domains = {},
  approach = {},
  project = {}
}

Details

Sina Zarrieß, and David Schlangen Easy Things First: Installments Improve Referring Expression Generation for Objects in Photographs Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016) 2016 [PDF]

BibTeX

@inproceedings{Zarrieß-2016,
  author = {Zarrieß, Sina and Schlangen, David},
  booktitle = {Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016)},
  location = {Berlin, Germany},
  title = {{Easy Things First: Installments Improve Referring Expression Generation for Objects in Photographs}},
  year = {2016},
  topics = {},
  domains = {},
  approach = {},
  project = {}
}

Details

David Schlangen, Sina Zarrieß, and Casey Kennington Resolving References to Objects in Photographs using the Words-As-Classifiers Model Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2016 [PDF]

BibTeX

@inproceedings{Schlangen-2016,
  title = {Resolving References to Objects in Photographs using the Words-As-Classifiers Model},
  author = {Schlangen, David and Zarrieß, Sina and Kennington, Casey},
  booktitle = {Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  month = aug,
  year = {2016},
  address = {Berlin, Germany},
  publisher = {Association for Computational Linguistics},
  url = {https://aclanthology.org/P16-1115},
  doi = {10.18653/v1/P16-1115},
  pages = {1213--1223},
  topics = {},
  domains = {},
  approach = {},
  project = {}
}

Details

Sina Zarrieß, and David Schlangen Towards Generating Colour Terms for Referents in Photographs: Prefer the Expected or the Unexpected? Proceedings of the 9th International Natural Language Generation conference 2016 [Abs] [PDF]

BibTeX

@inproceedings{Zarrieß-2016-4,
  author = {Zarrieß, Sina and Schlangen, David},
  booktitle = {Proceedings of the 9th International Natural Language Generation conference},
  location = {Edinburgh, UK},
  pages = {246----255},
  publisher = {Association for Computational Linguistics},
  title = {{Towards Generating Colour Terms for Referents in Photographs: Prefer the Expected or the Unexpected?}},
  year = {2016},
  topics = {},
  domains = {},
  approach = {},
  project = {}
}

Details

Ramesh Manuvinakurike, Casey Kennington, David DeVault, and David Schlangen Real-Time Understanding of Complex Discriminative Scene Descriptions Proceedings of the 17th Annual SIGdial Meeting on Discourse and Dialogue 2016 [PDF]
BibTeX
```
@inproceedings{Manuvinakurike-2016-1,
  author = {Manuvinakurike, Ramesh and Kennington, Casey and DeVault, David and Schlangen, David},
  booktitle = {Proceedings of the 17th Annual SIGdial Meeting on Discourse and Dialogue},
  location = {Los Angeles, CA, USA},
  title = {{Real-Time Understanding of Complex Discriminative Scene Descriptions}},
  year = {2016},
  topics = {},
  domains = {},
  approach = {},
  project = {}
}
```
Details

David Schlangen Grounding, Justification, Adaptation: Towards Machines That Mean What They Say Proceedings of the 20th Workshop on the Semantics and Pragmatics of Dialogue (JerSem) 2016 [PDF]

BibTeX

@inproceedings{Schlangen-2016-1,
  author = {Schlangen, David},
  booktitle = {Proceedings of the 20th Workshop on the Semantics and Pragmatics of Dialogue (JerSem)},
  location = {New Brunswick, United States},
  title = {{Grounding, Justification, Adaptation: Towards Machines That Mean What They Say}},
  year = {2016},
  topics = {},
  domains = {},
  approach = {},
  project = {}
}

Details

Casey Kennington, and David Schlangen Simple Learning and Compositional Application of Perceptually Grounded Word Meanings for Incremental Reference Resolution Proceedings of the Conference for the Association for Computational Linguistics (ACL) 2015 [PDF]

BibTeX

@inproceedings{Kennington-2015-2,
  author = {Kennington, Casey and Schlangen, David},
  booktitle = {Proceedings of the Conference for the Association for Computational Linguistics (ACL)},
  location = {Beijing, China},
  pages = {292--301},
  publisher = {Association for Computational Linguistics},
  title = {{Simple Learning and Compositional Application of Perceptually Grounded Word Meanings for Incremental Reference Resolution}},
  year = {2015},
  topics = {},
  domains = {},
  approach = {},
  project = {}
}

Details

Code

GitHub repo for ACL-2016 paper (resolving references to objects in photographs). Covers the basics of the application of the WAC model to the SAIAPR and MSCOCO datasets. Not fully up-to-date, unfortunately.