Event: Linguistics Circle Seminar
Date: Friday 18 November 2022
Time: 12:00 - 13:00
Venue: Online (via Zoom)
The "grounding problem" formulated by Harnad refers to the ability of an artificial system to establish non-arbitrary links between symbols (e.g. words or phrases) and non-symbolic features, such as arise from perception and experience. In NLP, a promising way to achieve this is via vision-language models. These are deep neural networks based on Transformer architectures, which are trained on combinations of visual and linguistic data. Such models achieve impressive results on multimodal tasks such as image retrieval based on text, or visual question answering. But do they really learn to ground linguistic expressions in vision?
For example, are they able to "see" past surface features of a descriptive caption (e.g. the objects it mentions) and also "understand" subtler linguistic features, such as the distinction between "on" and "next to", or distinctions in word order? After all, the picture corresponding to "the cat scratched the dog" is different from the one corresponding to "the dog scratched the cat" (but not from the one corresponding to "the dog was scratched by the cat").
In this talk, I'll give an overview of these neural architectures and how they are trained, and then discuss some of our recent work on developing evaluation suites to probe these models' linguistic grounding abilities. In particular, I will examine some of the results obtained on syntax-oriented grounding tests. It turns out that these models still fall short of full grounding capabilities. I will conclude with a discussion of why this is the case, ending with a broader discussion of what grounding means, in light of ongoing research in my group on challenging problems for visually grounded language processing.