Abstract: The ‘symbol grounding’ problem refers to the ability of models to establish non-arbitrary links between symbols (for example, words or phrases) and data from perception and experience. In recent years, several deep, transformer-based models have been proposed which purport to learn such relationships, following extensive pretraining on datasets consisting of images (or videos) paired with text. One of the goals of this lecture will be to provide an overview of the state of the art in the field, with particular reference to available datasets consisting of images and text, model architectures, and tasks. One of the key questions that arises here is: What do these models learn to ground? We will see that there is a strong object-centric bias in much of the training data used for such models, and will address some research to overcome this bias. We then turn to some recent research focusing on model analysis and evaluation, delving into recent work on analysing the grounding capabilities of such models, using a variety of techniques, including foil-based methods and ablation.
Bio: Albert Gatt is Associate Professor in the Natural Language Processing Group at the Department of Information and Computing Sciences, Utrecht University. He is also affiliated with the Institute of Linguistics and Language Technology, University of Malta. His main research interests are in Natural Language Generation and the interface between Vision and Language. He has also worked extensively on NLP for under-resourced languages.